Open
Conversation
manalilatkar
requested changes
Apr 15, 2026
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
31f9661 to
70bed40
Compare
| "caption": caption, | ||
| "markdown_source": markdown_source, | ||
| "page_number": page_number, | ||
| "table_index": tab_idx, |
Member
Author
There was a problem hiding this comment.
table_index isn't used right now- will remove it.But rest 3 are used later.
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
| "query_vllm_stream_de": "Sie erhalten: 1. **Einen kurzen Kontexttext** mit sachlichen Informationen.\n2. **Die Frage eines Nutzers**, der um Klärung oder Rat bittet.\n3. **Geben Sie eine prägnante und aussagekräftige Antwort, die sich strikt auf den gegebenen Kontext stützt.**\n\nDie Antwort sollte korrekt, leicht verständlich und kontextbezogen sein sowie eine klare Begründung enthalten.\nWenn der Kontext nicht genügend Informationen liefert, antworten Sie mit Ihrem Allgemeinwissen.\n\nKontext:{context}\n\nFrage:{question}\n\nAntwort:", | ||
| "llm_classify": "You are an intelligent assistant helping to curate a knowledge base for a Retrieval-Augmented Generation (RAG) system.\nYour task is to decide whether the following text should be included in the knowledge corpus. Respond only with \"yes\" or \"no\".\n\nRespond \"yes\" if the text contains factual, instructional, or explanatory information that could help answer general user questions on any topic.\nRespond \"no\" if the text contains personal, administrative, or irrelevant content, such as names, acknowledgements, contact info, disclaimers, legal notices, or unrelated commentary.\n\nText: {text}\n\nAnswer:", | ||
| "table_summary": "You are an intelligent assistant analyzing set of documents.\nYou are given a table extracted from a document. Your task is to summarize the key points and insights from the table. Avoid repeating the entire content; focus on what is meaningful or important.\n\nTable:\n{content}\n\nSummary:", | ||
| "table_summary_and_classify": "You are an intelligent assistant analyzing tables extracted from documents.\n\nYour tasks:\n\n1. Extract and document EVERY piece of information from the table in extensive detail:\n - List ALL sections, subsections, and their reference numbers if present\n - Include EVERY specification, feature, number, code, condition, and requirement\n - Mention ALL items even if they seem minor - nothing should be omitted\n - Use structured format with clear organization (numbered lists, bullet points, or detailed paragraphs)\n - Be extremely thorough and comprehensive - aim for maximum detail\n - If the table has multiple rows/columns, describe each one\n - Preserve all technical terms, version numbers, and specific details exactly as shown\n\n2. Decide if the table is relevant for a knowledge base:\n - Relevant: contains factual, instructional, or explanatory info useful for answering questions.\n - Irrelevant: personal info, disclaimers, administrative notes, or unrelated commentary.\n\n3. Output in the exact format below:\n\nSummary: <your extremely detailed summary here - be verbose and comprehensive>\nDecision: <yes or no>\n\nDo NOT output JSON, extra commentary, or any other text.\n\nExamples:\n\nPositive example (relevant):\nTable:\n| Processor | Cores | Memory |\n|-----------|-------|--------|\n| Power10 | 16 | 8 TB |\n\nOutput:\nSummary: The table presents technical specifications for the Power10 processor. The processor configuration includes 16 cores for parallel processing capabilities. The memory capacity supports up to 8 TB (terabytes) of RAM, providing substantial memory resources for enterprise workloads and data-intensive applications.\nDecision: yes\n\nNegative example (irrelevant):\nTable:\n| Prepared by: | John Smith |\n|--------------|------------|\n\nOutput:\nSummary: Document metadata indicating it was prepared by John Smith.\nDecision: no\n\nNow analyze the table below:\n\nTable:\n{content}", |
Member
Author
There was a problem hiding this comment.
New prompt:
You are an intelligent assistant analyzing tables extracted from documents.
Your tasks:
1. Extract and document EVERY piece of information from the table in extensive detail:
- List ALL sections, subsections, and their reference numbers if present
- Include EVERY specification, feature, number, code, condition, and requirement
- Mention ALL items even if they seem minor - nothing should be omitted
- Use structured format with clear organization (numbered lists, bullet points, or detailed paragraphs)
- Be extremely thorough and comprehensive - aim for maximum detail
- If the table has multiple rows/columns, describe each one
- Preserve all technical terms, version numbers, and specific details exactly as shown
2. Decide if the table is relevant for a knowledge base:
- Relevant: contains factual, instructional, or explanatory info useful for answering questions.
- Irrelevant: personal info, disclaimers, administrative notes, or unrelated commentary.
3. Output in the exact format below:
Summary: <your extremely detailed summary here - be verbose and comprehensive>
Decision: <yes or no>
Do NOT output JSON, extra commentary, or any other text.
Examples:
Positive example (relevant):
Table:
| Processor | Cores | Memory |
|-----------|-------|--------|
| Power10 | 16 | 8 TB |
Output:
Summary: The table presents technical specifications for the Power10 processor. The processor configuration includes 16 cores for parallel processing capabilities. The memory capacity supports up to 8 TB (terabytes) of RAM, providing substantial memory resources for enterprise workloads and data-intensive applications.
Decision: yes
Negative example (irrelevant):
Table:
| Prepared by: | John Smith |
|--------------|------------|
Output:
Summary: Document metadata indicating it was prepared by John Smith.
Decision: no
Now analyze the table below:
Table:
{content}
Member
There was a problem hiding this comment.
Increasing the prompt length may exhaust your max_tokens limit since you still need to include the source text. You also need to assess the impact of this change.
Member
|
for better visibility: Increasing the prompt length may exhaust your max_tokens limit since you still need to include the source text. You also need to assess the impact of this change. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As a part of this PR :
Extended chunking pipeline to include table summaries alongside text summaries (previously only text summaries were chunked).
Improved ingestion pipeline by exporting tables as Markdown, removing redundant caption handling, and adding minor cleanup and clarifying comments
Consolidated and simplified LLM utilities by introducing a single unified summary + classification prompt