Boring work // no more, DIY hacks vs Corp services

2 min readDec 14, 2023

LLM-based solutions are useful and can do many things for consumers and enterprises, but they come with costs.

https://www.youtube.com/watch?v=7gMg98Hf3uM

Non boring work with docs, DIY + Open models

# Let's use open LLMs, HF-inference
!pip install farm-haystack[colab]
!pip install PyPDF2
# HuggingFace Token
from getpass import getpass
HF_TOKEN = getpass("HuggingFace Token")
from haystack.nodes import PreProcessor,PromptModel, PromptTemplate, PromptNode
# Upload doc file
from google.colab import files
files.upload()
# PDF
from haystack import Document
pdf_file_path = "doc.pdf"
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text
pdf_text = extract_text_from_pdf(pdf_file_path)
doc = Document(
    content=pdf_text,
    meta={"pdf_path": pdf_file_path}
)
# doc
docs = [doc]
processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=500,
    split_respect_sentence_boundary=True,
    split_overlap=0,
    language="it",
)
preprocessed_docs = processor.process(docs)
# doc store
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore(use_bm25=True)
document_store.write_documents(preprocessed_docs)
# retriever
from haystack import Pipeline
from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store, top_k=2)
# prompt templates
prompt_template1 = PromptTemplate(prompt=
"""Using only the information contained in the context,
answer only the question posed without adding suggestions of possible questions.
If the answer cannot be inferred from the context, answer: "\ I don't know why it is not relevant to the Context.\"
Context: {join(documents)};
Question: {query}
""")
prompt_template2 = PromptTemplate(prompt=
"""Use the following context to generate output.
Context: {join(documents)};
{query}
""")
#models
model="mistralai/Mixtral-8x7B-Instruct-v0.1"
prompt_node = PromptNode(
    model_name_or_path=model,
    api_key=HF_TOKEN,
    default_prompt_template=prompt_template2,
    max_length=500,
    model_kwargs={"model_max_length": 5000}
)
rag_pipeline = Pipeline()
rag_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
rag_pipeline.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])
from pprint import pprint
print_answer = lambda out: pprint(out["results"][0].strip())
# query doc
print_answer(rag_pipeline.run(query="What is the core idea of this doc?"))

The problem is that LLMs can’t generate abstract concepts. If your question is closely related to content in a document, you will get a relevant response. However, abstract or general questions won’t work well.

Boring work // no more, DIY hacks vs Corp services

Non boring work with docs, DIY + Open models

Alternatives?

Corp services

Written by sbagency

No responses yet