Enhance table processing flow by shwetha-s-poojary · Pull Request #626 · IBM/project-ai-services

shwetha-s-poojary · 2026-04-15T10:06:55Z

As a part of this PR :

Extended chunking pipeline to include table summaries alongside text summaries (previously only text summaries were chunked).
- Introduce one more intermediate file *_table_chunk.json to process chunks.
- Renamed create_chunk_document to merge_chunked_documents
Improved ingestion pipeline by exporting tables as Markdown, removing redundant caption handling, and adding minor cleanup and clarifying comments
Consolidated and simplified LLM utilities by introducing a single unified summary + classification prompt
- Increased max_tokens to 1024

manalilatkar

overall LGTM.

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

dharaneeshvrd · 2026-04-16T13:19:36Z

+                        "caption": caption,
+                        "markdown_source": markdown_source,
+                        "page_number": page_number,
+                        "table_index": tab_idx,


this is not used

table_index isn't used right now- will remove it.But rest 3 are used later.

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

shwetha-s-poojary · 2026-04-17T05:01:30Z

      "query_vllm_stream_de": "Sie erhalten: 1. **Einen kurzen Kontexttext** mit sachlichen Informationen.\n2. **Die Frage eines Nutzers**, der um Klärung oder Rat bittet.\n3. **Geben Sie eine prägnante und aussagekräftige Antwort, die sich strikt auf den gegebenen Kontext stützt.**\n\nDie Antwort sollte korrekt, leicht verständlich und kontextbezogen sein sowie eine klare Begründung enthalten.\nWenn der Kontext nicht genügend Informationen liefert, antworten Sie mit Ihrem Allgemeinwissen.\n\nKontext:{context}\n\nFrage:{question}\n\nAntwort:",
-      "llm_classify": "You are an intelligent assistant helping to curate a knowledge base for a Retrieval-Augmented Generation (RAG) system.\nYour task is to decide whether the following text should be included in the knowledge corpus. Respond only with \"yes\" or \"no\".\n\nRespond \"yes\" if the text contains factual, instructional, or explanatory information that could help answer general user questions on any topic.\nRespond \"no\" if the text contains personal, administrative, or irrelevant content, such as names, acknowledgements, contact info, disclaimers, legal notices, or unrelated commentary.\n\nText: {text}\n\nAnswer:",
-      "table_summary": "You are an intelligent assistant analyzing set of documents.\nYou are given a table extracted from a document. Your task is to summarize the key points and insights from the table. Avoid repeating the entire content; focus on what is meaningful or important.\n\nTable:\n{content}\n\nSummary:",
+      "table_summary_and_classify": "You are an intelligent assistant analyzing tables extracted from documents.\n\nYour tasks:\n\n1. Extract and document EVERY piece of information from the table in extensive detail:\n   - List ALL sections, subsections, and their reference numbers if present\n   - Include EVERY specification, feature, number, code, condition, and requirement\n   - Mention ALL items even if they seem minor - nothing should be omitted\n   - Use structured format with clear organization (numbered lists, bullet points, or detailed paragraphs)\n   - Be extremely thorough and comprehensive - aim for maximum detail\n   - If the table has multiple rows/columns, describe each one\n   - Preserve all technical terms, version numbers, and specific details exactly as shown\n\n2. Decide if the table is relevant for a knowledge base:\n   - Relevant: contains factual, instructional, or explanatory info useful for answering questions.\n   - Irrelevant: personal info, disclaimers, administrative notes, or unrelated commentary.\n\n3. Output in the exact format below:\n\nSummary: <your extremely detailed summary here - be verbose and comprehensive>\nDecision: <yes or no>\n\nDo NOT output JSON, extra commentary, or any other text.\n\nExamples:\n\nPositive example (relevant):\nTable:\n| Processor | Cores | Memory |\n|-----------|-------|--------|\n| Power10   | 16    | 8 TB   |\n\nOutput:\nSummary: The table presents technical specifications for the Power10 processor. The processor configuration includes 16 cores for parallel processing capabilities. The memory capacity supports up to 8 TB (terabytes) of RAM, providing substantial memory resources for enterprise workloads and data-intensive applications.\nDecision: yes\n\nNegative example (irrelevant):\nTable:\n| Prepared by: | John Smith |\n|--------------|------------|\n\nOutput:\nSummary: Document metadata indicating it was prepared by John Smith.\nDecision: no\n\nNow analyze the table below:\n\nTable:\n{content}",


New prompt:

You are an intelligent assistant analyzing tables extracted from documents. Your tasks: 1. Extract and document EVERY piece of information from the table in extensive detail: - List ALL sections, subsections, and their reference numbers if present - Include EVERY specification, feature, number, code, condition, and requirement - Mention ALL items even if they seem minor - nothing should be omitted - Use structured format with clear organization (numbered lists, bullet points, or detailed paragraphs) - Be extremely thorough and comprehensive - aim for maximum detail - If the table has multiple rows/columns, describe each one - Preserve all technical terms, version numbers, and specific details exactly as shown 2. Decide if the table is relevant for a knowledge base: - Relevant: contains factual, instructional, or explanatory info useful for answering questions. - Irrelevant: personal info, disclaimers, administrative notes, or unrelated commentary. 3. Output in the exact format below: Summary: <your extremely detailed summary here - be verbose and comprehensive> Decision: <yes or no> Do NOT output JSON, extra commentary, or any other text. Examples: Positive example (relevant): Table: | Processor | Cores | Memory | |-----------|-------|--------| | Power10 | 16 | 8 TB | Output: Summary: The table presents technical specifications for the Power10 processor. The processor configuration includes 16 cores for parallel processing capabilities. The memory capacity supports up to 8 TB (terabytes) of RAM, providing substantial memory resources for enterprise workloads and data-intensive applications. Decision: yes Negative example (irrelevant): Table: | Prepared by: | John Smith | |--------------|------------| Output: Summary: Document metadata indicating it was prepared by John Smith. Decision: no Now analyze the table below: Table: {content}

Increasing the prompt length may exhaust your max_tokens limit since you still need to include the source text. You also need to assess the impact of this change.

mkumatag · 2026-04-17T13:51:20Z

for better visibility: Increasing the prompt length may exhaust your max_tokens limit since you still need to include the source text. You also need to assess the impact of this change.

shwetha-s-poojary requested review from Niharika0306 and dharaneeshvrd April 15, 2026 10:24

manalilatkar requested changes Apr 15, 2026

View reviewed changes

Comment thread spyre-rag/src/common/llm_utils.py Outdated

Comment thread spyre-rag/src/common/llm_utils.py Outdated

dharaneeshvrd reviewed Apr 15, 2026

View reviewed changes

shwetha-s-poojary added 7 commits April 16, 2026 07:41

Add table chunking; switch tables from HTML to MD

90a6591

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

Merge summarize and classify prompts and increase max_token

a4ec9f0

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

Set default classify and failed summaries classication to False

ae0f450

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

modify summarize and classify utils and prompt examples to use md

6ce0688

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

Clean and improvise chunking logic

0c85976

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

Make summarize and classify max_tokens configurable

61827a6

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

bump rag images

70bed40

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

shwetha-s-poojary force-pushed the enhance_table_processing branch from 31f9661 to 70bed40 Compare April 16, 2026 11:45

dharaneeshvrd reviewed Apr 16, 2026

View reviewed changes

refactor parts of chunking logic for better readability

5854ae3

Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>

shwetha-s-poojary commented Apr 17, 2026

View reviewed changes

Conversation

shwetha-s-poojary commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

As a part of this PR :

Uh oh!

manalilatkar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dharaneeshvrd Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

shwetha-s-poojary Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shwetha-s-poojary Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

mkumatag Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

mkumatag commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shwetha-s-poojary commented Apr 15, 2026 •

edited

Loading