Boring work // no more, DIY hacks vs Corp services

sbagency
2 min readDec 14, 2023

--

LLM-based solutions are useful and can do many things for consumers and enterprises, but they come with costs.

https://www.youtube.com/watch?v=7gMg98Hf3uM

Non boring work with docs, DIY + Open models

# Let's use open LLMs, HF-inference
!pip install farm-haystack[colab]
!pip install PyPDF2
# HuggingFace Token
from getpass import getpass
HF_TOKEN = getpass("HuggingFace Token")
from haystack.nodes import PreProcessor,PromptModel, PromptTemplate, PromptNode
# Upload doc file
from google.colab import files
files.upload()
# PDF
from haystack import Document
pdf_file_path = "doc.pdf"
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
return text
pdf_text = extract_text_from_pdf(pdf_file_path)
doc = Document(
content=pdf_text,
meta={"pdf_path": pdf_file_path}
)
# doc
docs = [doc]
processor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by="word",
split_length=500,
split_respect_sentence_boundary=True,
split_overlap=0,
language="it",
)
preprocessed_docs = processor.process(docs)
# doc store
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore(use_bm25=True)
document_store.write_documents(preprocessed_docs)
# retriever
from haystack import Pipeline
from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store, top_k=2)
# prompt templates
prompt_template1 = PromptTemplate(prompt=
"""Using only the information contained in the context,
answer only the question posed without adding suggestions of possible questions.
If the answer cannot be inferred from the context, answer: "\ I don't know why it is not relevant to the Context.\"
Context: {join(documents)};
Question: {query}
""")
prompt_template2 = PromptTemplate(prompt=
"""Use the following context to generate output.
Context: {join(documents)};
{query}
""")
#models
model="mistralai/Mixtral-8x7B-Instruct-v0.1"
prompt_node = PromptNode(
model_name_or_path=model,
api_key=HF_TOKEN,
default_prompt_template=prompt_template2,
max_length=500,
model_kwargs={"model_max_length": 5000}
)
rag_pipeline = Pipeline()
rag_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
rag_pipeline.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])
from pprint import pprint
print_answer = lambda out: pprint(out["results"][0].strip())
# query doc
print_answer(rag_pipeline.run(query="What is the core idea of this doc?"))

The problem is that LLMs can’t generate abstract concepts. If your question is closely related to content in a document, you will get a relevant response. However, abstract or general questions won’t work well.

Alternatives?

Corp services

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet