Theory
Crawler + Index + Algorithm
indexer searcher stemmer ranker
inverted index?
Building a personal search engine
Examples
https://github.com/thesephist/monocle?tab=readme-ov-file#monocle- Data in the form of modules → tokenizer → indexer (inverted index) Query → tokenizer → stemming expansion → search → rank (tf-idf)
https://www.youtube.com/watch?v=PWTPSukXeIg https://github.com/siddhantdubey/Sidgrep?tab=readme-ov-file Youtube transcript data → OpenAPI embeddings → Pinecone API for storage Search → OpenAPI embeddings → Pinecone API gives results https://www.youtube.com/watch?v=UUnAcrzA0nA https://github.com/thesephist/ycvibecheck/tree/main https://marketbrew.ai/understanding-query-parsers-how-search-engines-process-your-searches
Crawler
https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/ https://jsoup.org/ https://github.com/yasserg/crawler4j
React and Next
https://react.dev/ https://nextjs.org/
https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/ https://jamescalam.medium.com/free-course-on-vector-similarity-search-and-faiss-9b3e91a91384 https://www.webfx.com/blog/internet/what-is-a-web-crawler/ https://www.youtube.com/watch?v=7RF03_WQJpQ&t=203s