Skip to content

Commit

Permalink
Multiples changes for version 3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
pablomarin committed Dec 30, 2024
1 parent 9a7df82 commit 980fafc
Show file tree
Hide file tree
Showing 64 changed files with 5,868 additions and 4,820 deletions.
129 changes: 92 additions & 37 deletions 01-Load-Data-ACogSearch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Introduction\n",
"\n",
"Welcome to this repository. We will be walking you to a series of notebooks in which you will understand how RAG works (Retrieval Augmented Generation, a technique that combines the power of search and generative AI to answer user queries). We will work with different sources (Azure AI Search, Files, SQL Server, Websites, APIs, etc) and at the end of the notebooks you will understand why the magic happens with the combination of:\n",
"Welcome to this repository. We will be walking you to a series of notebooks in which you will understand how Agents and RAG works (Retrieval Augmented Generation, a technique that combines the power of search and generative AI to answer user queries). We will work with different sources (Azure AI Search, Files, SQL Server, Websites, APIs, etc) and at the end of the notebooks you will understand why the magic happens with the combination of:\n",
"\n",
"1) Multi-Agents: Agents talking to each other\n",
"2) Azure OpenAI models\n",
Expand All @@ -26,7 +26,7 @@
"\n",
"In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has all the dialogues of each episode of the TV Series show: FRIENDS. (3.1k text files).\n",
"\n",
"Although only TXT files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
"Although only TXT files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: PDF, Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
"Azure Search support the following sources: [Data Sources Gallery](https://learn.microsoft.com/EN-US/AZURE/search/search-data-sources-gallery)\n",
"\n",
"This notebook creates the following objects on your search service:\n",
Expand Down Expand Up @@ -103,6 +103,21 @@
"## Upload local dataset to Blob Container"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Define connection string and other parameters\n",
"BLOB_CONTAINER_NAME = \"friends\"\n",
"BLOB_NAME = \"friends_transcripts.zip\"\n",
"LOCAL_FILE_PATH = \"./data/\" + BLOB_NAME # Path to the local file you want to upload\n",
"upload_directory = \"./data/temp_extract\" # Temporary directory to extract the zip file"
]
},
{
"cell_type": "code",
"execution_count": 4,
Expand All @@ -122,28 +137,22 @@
"name": "stderr",
"output_type": "stream",
"text": [
"Uploading Files: 100%|██████████████████████████████████████████| 3107/3107 [08:57<00:00, 5.78it/s]\n"
"Uploading Files: 100%|██████████████████████████████████████████| 3107/3107 [09:02<00:00, 5.72it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Temp Folder: ./data/temp_extract removed\n",
"CPU times: user 34 s, sys: 5.15 s, total: 39.2 s\n",
"Wall time: 11min 48s\n"
"CPU times: user 32.1 s, sys: 5.05 s, total: 37.1 s\n",
"Wall time: 11min 15s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"# Define connection string and other parameters\n",
"BLOB_CONTAINER_NAME = \"friends\"\n",
"BLOB_NAME = \"friends_transcripts.zip\"\n",
"LOCAL_FILE_PATH = \"./data/\" + BLOB_NAME # Path to the local file you want to upload\n",
"upload_directory = \"./data/temp_extract\" # Temporary directory to extract the zip file\n",
"\n",
"# Extract the zip file\n",
"extract_zip_file(LOCAL_FILE_PATH, upload_directory)\n",
"\n",
Expand Down Expand Up @@ -211,7 +220,7 @@
"\n",
"For information on Change and Delete file detection please see [HERE](https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs?tabs=rest-api)\n",
"\n",
"Also, if your data is one AWS or GCP, and do not want to move it to Azure, you can create a Azure Fabric shortcut in OneLake, and use Fabric as a datasource here. From the documentation [HERE](https://learn.microsoft.com/en-us/azure/search/search-how-to-index-onelake-files):\n",
"Also, if your data is on AWS or GCP, and do not want to move it to Azure, you can create a Azure Fabric shortcut in OneLake, and use Fabric as a datasource here. From the documentation [HERE](https://learn.microsoft.com/en-us/azure/search/search-how-to-index-onelake-files):\n",
"> If you use Microsoft Fabric and OneLake for data access to Amazon Web Services (AWS) and Google data sources, use this indexer to import external data into a search index. This indexer is available through the Azure portal, the 2024-05-01-preview REST API, and Azure SDK beta packages."
]
},
Expand Down Expand Up @@ -254,9 +263,22 @@
"We are also setting a semantic ranking over a result set, promoting the most semantically relevant results to the top of the stack. You can also get semantic captions, with highlights over the most relevant terms and phrases, and semantic answers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**A note about compression and space optimization**: \n",
"Creating a Vector Index requires optimization otherwise it can get very large and very expensive very quickly. From the documentation:\n",
"> Embeddings, or the numerical representation of heterogeneous content, are the basis of vector search workloads, but the sizes of embeddings make them hard to scale and expensive to process. Significant research and productization have produced multiple solutions for improving scale and reducing processing times. Azure AI Search taps into a number these capabilities for faster and cheaper vector workloads.\n",
"\n",
"\n",
"Below we will implement some of these compression techniques when it says `Compression (optional)`.\n",
"For detailed information about compression techniques please check the documentation [HERE](https://learn.microsoft.com/en-us/azure/search/vector-search-index-size?tabs=portal-vector-quota)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 9,
"metadata": {
"tags": []
},
Expand All @@ -278,13 +300,38 @@
" \"vectorSearch\": {\n",
" \"algorithms\": [\n",
" {\n",
" \"name\": \"myalgo\",\n",
" \"kind\": \"hnsw\"\n",
" \"name\": \"use-hnsw\",\n",
" \"kind\": \"hnsw\",\n",
" }\n",
" ],\n",
" \"compressions\": [ # Compression (optional)\n",
" {\n",
" \"name\": \"use-scalar\",\n",
" \"kind\": \"scalarQuantization\",\n",
" \"rescoringOptions\": {\n",
" \"enableRescoring\": \"true\",\n",
" \"defaultOversampling\": 10,\n",
" \"rescoreStorageMethod\": \"preserveOriginals\"\n",
" },\n",
" \"scalarQuantizationParameters\": {\n",
" \"quantizedDataType\": \"int8\"\n",
" },\n",
" \"truncationDimension\": 1024\n",
" },\n",
" {\n",
" \"name\": \"use-binary\",\n",
" \"kind\": \"binaryQuantization\",\n",
" \"rescoringOptions\": {\n",
" \"enableRescoring\": \"true\",\n",
" \"defaultOversampling\": 10,\n",
" \"rescoreStorageMethod\": \"preserveOriginals\"\n",
" },\n",
" \"truncationDimension\": 1024\n",
" }\n",
" ],\n",
" \"vectorizers\": [\n",
" \"vectorizers\": [ # converts text (or images) to vectors during query execution.\n",
" {\n",
" \"name\": \"openai\",\n",
" \"name\": \"use-openai\",\n",
" \"kind\": \"azureOpenAI\",\n",
" \"azureOpenAIParameters\":\n",
" {\n",
Expand All @@ -297,12 +344,19 @@
" }\n",
" ],\n",
" \"profiles\": [\n",
" {\n",
" \"name\": \"myprofile\",\n",
" \"algorithm\": \"myalgo\",\n",
" \"vectorizer\":\"openai\"\n",
" }\n",
" ]\n",
" {\n",
" \"name\": \"vector-profile-hnsw-scalar\",\n",
" \"compression\": \"use-scalar\", # Compression (optional)\n",
" \"algorithm\": \"use-hnsw\",\n",
" \"vectorizer\": \"use-openai\"\n",
" },\n",
" {\n",
" \"name\": \"vector-profile-hnsw-binary\",\n",
" \"compression\": \"use-binary\",\n",
" \"algorithm\": \"use-hnsw\",\n",
" \"vectorizer\": \"use-openai\"\n",
" }\n",
" ]\n",
" },\n",
" \"semantic\": {\n",
" \"configurations\": [\n",
Expand Down Expand Up @@ -331,14 +385,15 @@
" {\"name\": \"chunk\",\"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
" {\n",
" \"name\": \"chunkVector\",\n",
" \"type\": \"Collection(Edm.Single)\",\n",
" \"dimensions\": 1536, # IMPORTANT: Make sure these dimmensions match your embedding model name\n",
" \"vectorSearchProfile\": \"myprofile\",\n",
" \"type\": \"Collection(Edm.Half)\", # Compression (optional)\n",
" \"dimensions\": 3072, # IMPORTANT: Make sure these dimmensions match your embedding model name\n",
" \"vectorSearchProfile\": \"vector-profile-hnsw-scalar\",\n",
" \"searchable\": \"true\",\n",
" \"retrievable\": \"true\",\n",
" \"retrievable\": \"false\",\n",
" \"filterable\": \"false\",\n",
" \"sortable\": \"false\",\n",
" \"facetable\": \"false\"\n",
" \"facetable\": \"false\",\n",
" \"stored\": \"false\" # Compression (optional)\n",
" }\n",
" ]\n",
"}\n",
Expand All @@ -351,7 +406,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 10,
"metadata": {
"tags": []
},
Expand Down Expand Up @@ -406,7 +461,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 11,
"metadata": {
"tags": []
},
Expand Down Expand Up @@ -561,7 +616,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 12,
"metadata": {
"tags": []
},
Expand All @@ -586,7 +641,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 13,
"metadata": {
"tags": []
},
Expand Down Expand Up @@ -643,7 +698,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 14,
"metadata": {
"tags": []
},
Expand All @@ -662,7 +717,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 18,
"metadata": {
"tags": []
},
Expand Down Expand Up @@ -731,9 +786,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10 - SDK v2",
"display_name": "GPTSearch2 (Python 3.12)",
"language": "python",
"name": "python310-sdkv2"
"name": "gptsearch2"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -745,7 +800,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
"version": "3.12.8"
},
"vscode": {
"interpreter": {
Expand Down
Loading

0 comments on commit 980fafc

Please sign in to comment.