Skip to content

Commit

Permalink
Merge pull request #76 from pablomarin/main
Browse files Browse the repository at this point in the history
Update all notebooks to GPT-4o and GPT-4o-mini and new datasets
  • Loading branch information
pablomarin authored Oct 4, 2024
2 parents b40b2c2 + dd052c0 commit bd634ef
Show file tree
Hide file tree
Showing 25 changed files with 31,541 additions and 2,277 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ common/__pycache__/
.streamlit/
*.amltmp
*.amltemp
data/
credentials.env
.azure/
.vscode/
Expand Down
139 changes: 109 additions & 30 deletions 01-Load-Data-ACogSearch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,9 @@
"In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure Cognitive Search. \n",
"The reference sample can be found at [Tutorial: Use Python and AI to generate searchable content from Azure blobs](https://docs.microsoft.com/azure/search/cognitive-search-tutorial-blob-python).\n",
"\n",
"In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has ~9.8k Computer Science publication PDFs from the Arxiv dataset.\n",
"https://www.kaggle.com/datasets/Cornell-University/arxiv\n",
"In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has all the dialogues of each episode of the TV Series show: FRIENDS. 3.1k text files.\n",
"\n",
"If you want to explore the dataset, go [HERE](https://console.cloud.google.com/storage/browser/arxiv-dataset/arxiv/cs/pdf?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false)<br>\n",
"Note: This dataset has been copy to a public azure blob container for this demo\n",
"\n",
"Although only PDF files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
"Although only TXT files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
"Azure Search support the following sources: [Data Sources Gallery](https://learn.microsoft.com/EN-US/AZURE/search/search-data-sources-gallery)\n",
"\n",
"This notebook creates the following objects on your search service:\n",
Expand All @@ -57,23 +53,27 @@
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import shutil\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"load_dotenv(\"credentials.env\")\n",
"\n",
"# Name of the container in your Blob Storage Datasource ( in credentials.env)\n",
"BLOB_CONTAINER_NAME = \"arxivcs\""
"from common.utils import upload_file_to_blob, extract_zip_file, upload_directory_to_blob\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Define the names for the data source, skillset, index and indexer\n",
Expand All @@ -86,7 +86,9 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Setup the Payloads header\n",
Expand All @@ -98,13 +100,74 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Data Source (Blob container with the Arxiv CS pdfs)"
"## Upload local dataset to Blob Container"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting ./data/friends_transcripts.zip ... \n",
"Extracted ./data/friends_transcripts.zip to ./data/temp_extract\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Uploading Files: 100%|██████████████████████████████████████████| 3107/3107 [08:47<00:00, 5.89it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Temp Folder: ./data/temp_extract removed\n",
"CPU times: user 34.9 s, sys: 5.76 s, total: 40.6 s\n",
"Wall time: 11min 21s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"# Define connection string and other parameters\n",
"BLOB_CONTAINER_NAME = \"friends\"\n",
"BLOB_NAME = \"friends_transcripts.zip\"\n",
"LOCAL_FILE_PATH = \"./data/\" + BLOB_NAME # Path to the local file you want to upload\n",
"upload_directory = \"./data/temp_extract\" # Temporary directory to extract the zip file\n",
"\n",
"# Extract the zip file\n",
"extract_zip_file(LOCAL_FILE_PATH, upload_directory)\n",
"\n",
"# Upload the extracted files and folder structure\n",
"upload_directory_to_blob(upload_directory, BLOB_CONTAINER_NAME)\n",
"\n",
"# Clean up: Optionally, you can remove the temp folder after uploading\n",
"shutil.rmtree(upload_directory)\n",
"print(f\"Temp Folder: {upload_directory} removed\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Data Source (Blob container with the Arxiv CS pdfs)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
Expand All @@ -120,7 +183,7 @@
"\n",
"datasource_payload = {\n",
" \"name\": datasource_name,\n",
" \"description\": \"Demo files to demonstrate cognitive search capabilities.\",\n",
" \"description\": \"Demo files to demonstrate ai search capabilities.\",\n",
" \"type\": \"azureblob\",\n",
" \"credentials\": {\n",
" \"connectionString\": os.environ['BLOB_CONNECTION_STRING']\n",
Expand Down Expand Up @@ -154,8 +217,10 @@
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"execution_count": 6,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# If you have a 403 code, probably you have a wrong endpoint or key, you can debug by uncomment this\n",
Expand Down Expand Up @@ -191,8 +256,10 @@
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -284,8 +351,10 @@
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# r.text"
Expand Down Expand Up @@ -337,8 +406,10 @@
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"execution_count": 9,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -490,8 +561,10 @@
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"execution_count": 10,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# print(r.text)"
Expand All @@ -513,8 +586,10 @@
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"execution_count": 11,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -569,7 +644,9 @@
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Uncomment if you find an error\n",
Expand All @@ -585,7 +662,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 20,
"metadata": {
"tags": []
},
Expand All @@ -596,7 +673,7 @@
"text": [
"200\n",
"Status: inProgress\n",
"Items Processed: 154\n",
"Items Processed: 2180\n",
"True\n"
]
}
Expand All @@ -620,7 +697,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**When the indexer finishes running we will have all 9.8k documents indexed in your Search Engine!.**"
"**When the indexer finishes running we will have all 994 documents indexed in your Search Engine!.**\n",
"\n",
"**Note:** Noticed that it only index 1 document (the zip file) but the AI Search service did the work of uncompressing it and indexing each individual doc**"
]
},
{
Expand Down Expand Up @@ -666,7 +745,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
"version": "3.10.14"
},
"vscode": {
"interpreter": {
Expand Down
Loading

0 comments on commit bd634ef

Please sign in to comment.