ใ€€

blog-cover-image

AI Interview Question - JP Morgan

AI-powered chatbots have rapidly transformed how banks and financial institutions like JP Morgan serve both their employees and clients. Retrieval-Augmented Generation (RAG) models, which combine large language models (LLMs) with enterprise document retrieval, promise to answer complex questions from vast corpora of internal documents. However, when scaling from a proof-of-concept (POC) to production, many teams notice a sharp drop in accuracy. Why does a chatbot that worked flawlessly on 100 documents fail when exposed to 50,000 PDFs, PowerPoints, and policy files? This article explores a real-world AI interview question posed at JP Morgan about troubleshooting RAG chatbot accuracy issues at scale—offering a beginner-friendly, step-by-step breakdown of the root causes and best-practice solutions.

AI Interview Question - JP Morgan: Understanding the Problem


The Interview Scenario

Question: Our RAG chatbot worked great in the POC with 100 documents. But when we scaled to 50,000 docs (PDFs, PPTs, policies) – accuracy dropped to 60%. Users got incomplete answers. What went wrong and how do you fix it?

Why Is This a Common RAG Failure?

RAG (Retrieval-Augmented Generation) chatbots depend on two main pillars: robust retrieval of relevant content and accurate generation of answers. At small scale, both retrieval and generation work well. But as the corpus grows, naïve approaches to document ingestion, chunking, and search struggle:

  • Chunks lose context: Important information gets split across arbitrary chunks.
  • Heterogeneous documents: Different formats (PDFs, slides, tables) are not handled uniformly.
  • Redundant/near-duplicate content: Multiple versions confuse the retriever and LLM.

Let’s break down the root causes and build an SEO-optimized, actionable guide to scaling RAG chatbots for enterprises like JP Morgan.

1. The Foundation: How RAG Chatbots Work


What Is Retrieval-Augmented Generation (RAG)?

RAG chatbots enhance LLMs by allowing them to retrieve facts from external sources (your company’s documents, policies, and FAQs). Here’s a simplified workflow:

  1. User asks a question (e.g., “What is the premium savings account rate?”).
  2. System retrieves top-k relevant chunks of text from the document corpus using similarity search.
  3. LLM reads those chunks and generates an answer.

This hybrid approach ensures answers are grounded in your company’s actual data—not just the LLM’s training distribution.

Why Is Scaling So Hard?

When you scale from a few hundred to tens of thousands of documents, several technical and organizational challenges emerge:

  • Document heterogeneity: PDFs, scanned images, PowerPoints, tables, etc.
  • Out-of-date or duplicate content: Old policies living alongside new ones.
  • Simple chunking strategies break down, causing loss of context.
  • Retrieval quality suffers as the search index becomes “noisier.”

Example: The Chunking Problem

Imagine a banking policy document where:

  • Page 1 says: Savings Account rate: 3.5%
  • Page 29 says: Premium savings rate: 4.2%

If you chunk every 512 tokens, you might end up splitting these related facts, making accurate retrieval almost impossible.

2. Diagnosing the Real Problem: Why Did Accuracy Drop?


Why Naive Data Pipelines Fail at Scale

The most common culprit is a data pipeline that isn’t “RAG-ready.” Let’s explore the major pitfalls:

Naive Chunking: Losing Context

Most teams start by splitting documents into fixed-size chunks (e.g., every 512 tokens). This is fast, but it ignores logical and semantic boundaries. The result:

  • Paragraphs, sentences, or even tables are split mid-way.
  • Context necessary to answer a user’s question is lost.
  • Chunks retrieved may be incomplete or misleading.

Heterogeneous Document Formats

Enterprise content comes in all shapes and sizes:

  • PDFs (scanned, digital-native, mixed content)
  • PowerPoint slides (bulleted, visual)
  • Word documents
  • Tables and structured data
  • Legal and compliance policies

Naive pipelines treat all formats equally, leading to poor extraction and retrieval.

Redundant and Near-Duplicate Content

Over time, companies accumulate multiple versions of the same policy. If not deduplicated, the retriever may return outdated or conflicting information, causing confusion for both the LLM and the user.

Issue Impact on RAG Chatbot
Naive chunking Breaks context, incomplete answers
Format heterogeneity Extraction errors, missing data
Redundant content Confusing, conflicting answers

What’s the Solution? A Two-Step Fix

To address these failures, industry leaders recommend a two-step approach:

  1. Smart Pre-processing: Make your data RAG-ready before it enters the index.
  2. Dual Retrieval: Combine keyword and semantic search to maximize relevant chunk recall.

3. Step 1 – Smart Pre-Processing: Making Data RAG-Ready


Why Pre-Processing Matters

Pre-processing transforms raw, messy documents into structured, context-rich chunks that are easy for both the retriever and LLM to use.

Context-Aware Chunking

Instead of splitting every N tokens, use logic that respects document structure. This can be achieved through:

  • Table of contents parsing
  • Heading detection (e.g., “Savings Account Policy”, “Interest Rates”)
  • LLM-assisted segmentation: Use an LLM to identify logical sections

Example: Treat the entire “Savings Account Policy” section as one semantic chunk, even if it’s 1500 tokens. Downstream, you can break it into smaller but context-preserving sub-chunks if needed.


# Example: LLM-assisted chunking pseudo-code
def context_aware_chunk(document_text):
    sections = llm_extract_sections(document_text)
    chunks = []
    for section in sections:
        heading, section_text = section
        # Further split if over max_tokens, but preserve full paragraphs
        sub_chunks = split_on_paragraphs(section_text, max_tokens=512)
        for chunk in sub_chunks:
            chunks.append({'heading': heading, 'text': chunk})
    return chunks

Search Metadata Generation

For each chunk, extract:

  • Key topics (e.g., “savings account interest rate”)
  • Likely user questions (e.g., “What’s the current rate?”)
  • Document source, version, and update date

Store this metadata alongside each chunk in your retrieval index. It improves both recall and precision during search.


# Example: Metadata extraction for a chunk
chunk = "Section: Savings Account Policy. The current rate is 3.5% as of Jan 2024."
metadata = {
    'topics': ['savings account', 'interest rate'],
    'sample_questions': ['What is the savings account interest rate?'],
    'source': 'savings_policy_v3.pdf',
    'date': '2024-01-15'
}

Converting Complex Docs to Q&A

Legal and compliance documents are notoriously dense. The best practice is to convert procedures and rules into FAQ-style Q&A pairs, either via LLM or with a human-in-the-loop for accuracy.

  • Human-in-the-loop: Compliance officers verify LLM-generated Q&A pairs.
  • LLM automation: Auto-generate FAQs for each section.

# Example: LLM converting policy to FAQ
prompt = f"Convert the following policy to a list of FAQs:\n{policy_section}"
faqs = llm_generate_faqs(prompt)

Deduplication and Canonicalization

Deduplicate redundant content and canonicalize versions by:

  • Hashing chunk contents
  • Comparing metadata fields
  • Retaining only the latest version

Handling Tables and Slides

Convert tables into structured data or text summaries; extract slide bullets and images as text. LLMs can help summarize complex visuals into retrievable text.

Summary Table: RAG-Ready Pre-Processing Steps

Step Purpose Impact
Context-aware chunking Preserve logical sections, avoid splitting context Full, accurate answers
Metadata generation Enhance retrievability and filtering Higher precision in search
Q&A conversion Make dense docs user-friendly Improved clarity for LLM
Deduplication Remove outdated/conflicting info Consistent, up-to-date answers

4. Step 2 – Dual Retrieval: Maximizing Recall and Precision


Why Not Just Use One Search Method?

RAG systems typically rely on vector search (e.g., ANN/HNSW) to find semantically similar chunks. But this can miss exact terms (e.g., policy IDs, legal terms). Classic keyword search (BM25) excels at catching these, but struggles with paraphrases or synonyms.

Dual Retrieval: The Best of Both Worlds

Combine both approaches:

  • BM25 (keyword): Finds chunks with exact term matches.
  • Vector search (e.g., HNSW): Finds semantically similar or paraphrased content.

Retrieve top-k results from both, then merge, re-rank, or cross-check as needed.

Query Normalization

User queries are often informal (“What's the rate on my savings?”). Normalize these into canonical banking terms using either a rules-based approach or an LLM.


# Example: Query normalization
user_query = "What's the rate on my savings?"
normalized_query = llm_normalize_query(user_query)
# Output: 'savings account interest rate policy'

Parallel Retrieval Agents

  • Agent 1: Uses vector search to generate a draft answer from top-k semantic chunks.
  • Agent 2: Refines the answer and cross-checks against top BM25 (keyword) results to fill in gaps or correct errors.

This “checks and balances” approach catches both paraphrased and literal information.

The Retrieval Pipeline in Practice


# Dual retrieval pipeline pseudo-code
def dual_retrieve_and_generate(user_query):
    normalized_query = llm_normalize_query(user_query)
    bm25_chunks = bm25_search(normalized_query, top_k=5)
    vector_chunks = vector_search(normalized_query, top_k=5)
    retrieved_chunks = merge_and_rerank(bm25_chunks, vector_chunks)
    draft_answer = llm_generate_answer(retrieved_chunks)
    refined_answer = llm_refine_with_bm25(draft_answer, bm25_chunks)
    return refined_answer

Equations: Measuring Retrieval Quality

Let’s define retrieval precision and recall mathematically using MathJax:

$$ \text{Precision} = \frac{\text{Relevant Items Retrieved}}{\text{Total Items Retrieved}} $$

$$ \text{Recall} = \frac{\text{Relevant Items Retrieved}}{\text{Total Relevant Items in Corpus}} $$

Dual retrieval aims to maximize both metrics by combining methods.

Summary Table: Dual Retrieval Components

Component Role Benefits
BM25 keyword search Finds exact terms, IDs, policies High precision for literal queries
HNSW/ANN vector search Finds semantic matches, paraphrases Better recall, natural language support
Query normalization Maps user language to domain terms Improves both recall and precision
Agent refinement Cross-checks, fills in gaps More complete, accurate answers

 

5. Real-World Implementation: Building a Scalable RAG Pipeline for JP Morgan


Putting It All Together: End-to-End Workflow

Let’s walk through how a global bank like JP Morgan can implement a scalable, accurate RAG chatbot using the best practices described above. This section will connect the theory to practical steps, so your AI system can move from POC to production without losing accuracy.

Step-by-Step Workflow Overview

  1. Document Ingestion: Collect all documents (PDFs, slides, tables, policies) from internal repositories.
  2. Pre-Processing: Apply context-aware chunking, metadata extraction, and Q&A generation for every document.
  3. Deduplication: Remove outdated or duplicate content, keeping only the canonical versions.
  4. Indexing: Store processed chunks and metadata in a hybrid search index (BM25 + vector DB).
  5. Query Handling: Normalize user queries to banking terminology using an LLM or custom mappings.
  6. Dual Retrieval: Fetch top results from both keyword and semantic search.
  7. Answer Generation/Refinement: Let LLMs generate draft answers and then refine them by cross-referencing all retrieved chunks.
  8. Human-in-the-Loop Feedback (optional): For mission-critical domains like compliance, add a review workflow for high-risk queries.

Sample Architecture Diagram

While we can’t use images here, below is a textual representation of a robust RAG architecture:


[Document Sources] 
      |
      v
[Pre-Processing Pipeline: Chunking, Metadata, Q&A, Deduplication]
      |
      v
[Hybrid Search Index: BM25 + Vector DB (e.g., Elasticsearch + Pinecone)]
      |
      v
[Query Normalization]
      |
      v
[Dual Retrieval (BM25 & Vector)]
      |
      v
[LLM Draft Answer Generation]
      |
      v
[LLM Refinement with Cross-Checks]
      |
      v
[User or Human-in-the-Loop Reviewer]
      |
      v
[Final Answer]

Choosing Tools & Technologies

  • Pre-processing: Python (NLTK, spaCy), OpenAI/GPT APIs for chunking and Q&A, PyPDF2, python-docx, etc.
  • Deduplication: Hashlib, MinHash, custom logic for versioning.
  • Vector DB: Pinecone, Weaviate, FAISS, Qdrant
  • Keyword Search: Elasticsearch, OpenSearch, Solr
  • LLM Generation: OpenAI GPT-4, Azure OpenAI, Cohere, etc.

Example: Context-Aware Chunking for Banking Policy Documents

Suppose you have a 100-page PDF “Deposit Accounts Policy.” Here’s how you can chunk it semantically:


import re
def extract_sections(text):
    # Simple regex to find headings, e.g., "1.1 Savings Accounts"
    pattern = r'(\d+\.\d+\s+[A-Za-z ]+)'
    sections = re.split(pattern, text)
    # Pair headings with their section bodies
    paired = []
    for i in range(1, len(sections), 2):
        heading = sections[i].strip()
        body = sections[i+1].strip()
        paired.append((heading, body))
    return paired

# Usage
with open('deposit_policy.txt') as f:
    doc_text = f.read()
sections = extract_sections(doc_text)
for heading, section_text in sections:
    # Further processing, metadata extraction, etc.
    print(f"Section: {heading}\n{section_text}\n---")

Example: Metadata Enrichment for a Chunk


chunk = "Section 1.1 Savings Accounts: The interest rate is 3.5% per annum as of Jan 2024."
metadata = {
    'topics': ['savings accounts', 'interest rate'],
    'sample_questions': ['What is the interest rate for savings accounts?', 'Savings account annual rate?'],
    'source': 'deposit_policy_2024.pdf',
    'date': '2024-01-05'
}
# Add to index: (chunk, metadata)

Example: Dual Retrieval Implementation


def retrieve_chunks(query):
    normalized = llm_normalize_query(query)
    bm25_results = bm25_search(normalized, k=5)
    vector_results = vector_search(normalized, k=5)
    # Merge and deduplicate
    all_results = {chunk['id']: chunk for chunk in bm25_results + vector_results}.values()
    # Optionally rerank or prioritize BM25 for policy/legal queries
    return list(all_results)

Handling Multimodal Documents

Financial documents often include tables, charts, and even images. To make these RAG-ready:

  • Tables: Use table extraction libraries (e.g., tabula-py, camelot) to convert tables into structured text or CSVs. LLMs can then summarize or rephrase these into FAQ pairs.
  • Slides: Extract text from PPTs using python-pptx, then chunk by slide or topic. For images, use OCR (e.g., Tesseract) and summarize the content.

Version Control and Canonicalization

To avoid confusion from outdated or duplicate documents:

  • Add version/date metadata to each chunk.
  • During retrieval or answer generation, prefer the most recent version.
  • For conflicting answers, let the LLM cite the latest policy or escalate for human review.

End-to-End Example: User Query to Final Answer

  1. User asks: “What’s the current premium savings rate?”
  2. Query normalization: “premium savings account interest rate policy”
  3. BM25 retrieves chunks mentioning “premium savings rate”
  4. Vector search finds semantically similar chunks (e.g., “interest on premium accounts”)
  5. LLM drafts answer using all retrieved information: “As of Jan 2024, the premium savings account rate is 4.2%.”
  6. LLM refines answer, cites document source and date

Measuring Success: KPIs for RAG Chatbots

Metric Description Target
Retrieval Precision Fraction of retrieved chunks that are relevant >90% for critical policies
Retrieval Recall Fraction of all relevant chunks that are retrieved >95%
Answer Accuracy Fraction of user queries answered correctly (human-judged) >90%
Latency Time from query to answer <2 seconds for most queries

6. Advanced Enhancements for Enterprise-Grade RAG


Human-in-the-Loop (HITL) Review

For highly sensitive domains like compliance or legal, integrate a workflow where certain answers are flagged for human review before being shown to the end user. This can be based on confidence scores, query type, or policy triggers.

Feedback Loops and Continuous Learning

  • Log user queries and answers, especially those marked as incomplete or incorrect.
  • Periodically retrain or fine-tune your LLM or retriever based on real feedback.
  • Update chunking and metadata extraction models as document templates evolve.

Personalization and Access Control

Not every user should see every document. Use metadata to implement access controls, ensuring users only retrieve content they’re authorized to view (e.g., retail vs. corporate banking policies).

Multilingual Support

For global banks, support queries and content in multiple languages. Use LLMs or translation APIs for both chunking and retrieval, and store language metadata for each chunk.

Explainability and Source Attribution

Always cite the source document, section, and date in generated answers. This builds trust and allows users to verify information independently.


# Example: Adding citations to answers
answer = "As of Jan 2024, the premium savings account interest rate is 4.2%."
citation = "(Source: premium_policy_v2.pdf, Section 3.1, 2024-01-05)"
full_answer = f"{answer}\n{citation}"

Monitoring and Alerting

Implement dashboards to monitor retrieval accuracy, answer quality, and system latency. Set up alerts for sudden drops in accuracy or spikes in incomplete answers, enabling rapid diagnosis and mitigation.

7. Common Pitfalls and How to Avoid Them


Summary of “What Went Wrong” in Scaling RAG

  • Chunking by fixed tokens splits logical sections—use context-aware chunking!
  • Ignoring document heterogeneity—custom parse and chunk each format type.
  • No deduplication—conflicting answers cause user frustration.
  • Relying only on vector search—misses literal policy terms.
  • No query normalization—user language doesn’t match enterprise terms.

Best Practices Checklist

  • Always pre-process documents for structure and metadata.
  • Use hybrid (BM25 + vector) retrieval, not just one or the other.
  • Continuously monitor and evaluate both system and end-user performance.
  • Involve subject matter experts in Q&A generation for compliance and legal docs.
  • Scale human review for high-risk answers.

8. Conclusion: Future-Proofing AI Chatbots at JP Morgan and Beyond


Scaling RAG chatbots from POC to production is a multi-stage, multidisciplinary challenge. The drop from 95%+ accuracy on 100 documents to 60% on 50,000 documents is not a failure of AI, but a signal that your data pipeline and retrieval strategy need to mature. By investing in context-aware pre-processing, robust metadata extraction, dual retrieval, and continuous monitoring, you can build enterprise-grade AI assistants that delight users and withstand the complexity of real-world banking documentation.

Whether you’re preparing for an AI interview at JP Morgan, architecting the next-gen customer support bot, or troubleshooting an underperforming RAG system, remember: the secret to accuracy at scale lies in the details of your data pipeline and retrieval logic. Make your chatbot truly RAG-ready, and you’ll unlock the full power of AI for your organization.

Key Takeaways

  • Context matters: Never split your documents blindly—preserve logical sections and context.
  • Hybrid retrieval: Combine keyword and semantic search for best results.
  • Metadata is gold: Use it for filtering, ranking, and access control.
  • Iterate and monitor: RAG performance improves with real-world feedback and continuous refinement.

Further Reading

Sample Interview Answer (Summary)

“The drop in accuracy is likely due to a naive data pipeline that wasn’t RAG-ready. Simple token-based chunking splits context, while diverse document formats and redundant content further confuse retrieval. To fix this at scale, I’d implement context-aware chunking (potentially with LLMs), extract rich metadata for each chunk, and convert dense docs to Q&A format. On the retrieval side, I’d use dual retrieval—combining BM25 keyword search and vector/ANN semantic search—along with query normalization and LLM refinement to maximize answer completeness and precision. Continuous monitoring and human-in-the-loop review further ensure enterprise-grade quality.”

By understanding and applying these principles, you’ll be well-equipped to design, troubleshoot, and scale AI-powered RAG chatbots for complex organizations like JP Morgan.

Related Articles