Hacker News Vector Search Dataset

Start… Enter a search query: are OLAP cubes useful Generating embeddings for “are OLAP cubes useful” Querying ClickHouse… Result: 27742647 smartmike: slt2021: The OLAP cube is not dead, as long as you use some form of it:1. Group by Multiple Fi ——— 27744260 georgefraser: A data mart is a logical organization of data to help humans understand the schema. Wh ——— 27761434 mwexler:”We model data according to rigid frameworks like Kimball or Inmon because we must r —— 28401230 chotmat: erosenbe0: OLAP database is just a copy, replica or collection of data with a schema design —— 22198879 Merrick for Apache Kylin: +1, It’s a great project and amazing open source community. If there is any So I ——— 27741776 crazydoggers: I always thought the value of an OLAP cube was uncovering questions you might not know about ——– 22189480 shadowson7: _codemonkeyism: After maintaining an OLAP cube system for a few years, I’m not that ——– 27742029 smartmik: gangstrand: My first introduction to OLAP was developing a front end for Essense. Was on a team that ——— 22364133 irfansharif: simo7: I’m wondering how this technique could work for OLAP cubes.

An OLAP cube ——— 23292746 scoresmoke: When I was developing my favorite project for web analytics (Summary Demo Application)

The example above demonstrates semantic search and document retrieval using ClickHouse.

A very simple but high-potential Generative AI example application is presented next.

The application performs the following steps:

accepts a Subject as input from user
Generates an embedding vector for Subject by using SentenceTransformers with model all-MiniLM-L6-v2
Retrieves highly relevant posts/comments using vector similarity search on hackernews table
Use LangChain and OpenAI gpt-3.5-turbo for chat api summarize Content retrieved in step #3. Posts/Comments received in Step #3 are passed as Context There are important links to chat APIs and to generative AI.

An example of running the summarization application is listed below first, followed by the code of the summarization application. An OpenAI API key needs to be set in an environment variable to run the application OPENAI_API_KEYOpenAI API key can be obtained after registration at https://platform,openai,com,

This application demonstrates a generative AI use-case that is applicable to multiple enterprise domains such as: customer sentiment analysis, technical support automation, mining user conversations, legal documents, medical records, meeting transcripts, financial statements, etc.

$ python3 summarize.py

Enter a search topic :
ClickHouse performance experiences

Generating the embedding for ---->  ClickHouse performance experiences

Querying ClickHouse to retrieve relevant articles...

Initializing chatgpt-3.5-turbo model...

Summarizing search results retrieved from ClickHouse...

Summary from chatgpt-3.5:
The discussion focuses on comparing ClickHouse with various databases like TimescaleDB, Apache Spark,
AWS Redshift, and QuestDB, highlighting ClickHouse's cost-efficient high performance and suitability
for analytical applications. Users praise ClickHouse for its simplicity, speed, and resource efficiency
in handling large-scale analytics workloads, although some challenges like DMLs and difficulty in backups
are mentioned. ClickHouse is recognized for its real-time aggregate computation capabilities and solid
engineering, with comparisons made to other databases like Druid and MemSQL. Overall, ClickHouse is seen
as a powerful tool for real-time data processing, analytics, and handling large volumes of data
efficiently, gaining popularity for its impressive performance and cost-effectiveness.

Code for the above application:

print("Initializing...")

import sys
import json
import time
from sentence_transformers import SentenceTransformer

import clickhouse_connect

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
import textwrap
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.encoding_for_model(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

chclient = clickhouse_connect.get_client(compress=False) # ClickHouse credentials here

while True:
    # Take the search query from user
    print("Enter a search topic :")
    input_query = sys.stdin.readline();
    texts = [input_query]

    # Run the model and obtain search or reference vector
    print("Generating the embedding for ----> ", input_query);
    embeddings = model.encode(texts)

    print("Querying ClickHouse...")
    params = {'v1':list(embeddings[0]), 'v2':100}
    result = chclient.query("SELECT id,title,text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)

    # Just join all the search results
    doc_results = ""
    for row in result.result_rows:
        doc_results = doc_results + "\n" + row[2]

    print("Initializing chatgpt-3.5-turbo model")
    model_name = "gpt-3.5-turbo"

    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
        model_name=model_name
    )

    texts = text_splitter.split_text(doc_results)

    docs = [Document(page_content=t) for t in texts]

    llm = ChatOpenAI(temperature=0, model_name=model_name)

    prompt_template = """
Write a concise summary of the following in not more than 10 sentences:


{text}


CONSCISE SUMMARY :
"""

    prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

    num_tokens = num_tokens_from_string(doc_results, model_name)

    gpt_35_turbo_max_tokens = 4096
    verbose = False

    print("Summarizing search results retrieved from ClickHouse...")

    if num_tokens <= gpt_35_turbo_max_tokens:
        chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=verbose)
    else:
        chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=prompt, combine_prompt=prompt, verbose=verbose)

    summary = chain.run(docs)

    print(f"Summary from chatgpt-3.5: {summary}")

<a href

Hacker News vector search dataset

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply