diff --git a/Advanced/regex/regex_tutorial_exercise_answer.ipynb b/Advanced/regex/regex_tutorial_exercise_answer.ipynb new file mode 100644 index 00000000..0e2c9951 --- /dev/null +++ b/Advanced/regex/regex_tutorial_exercise_answer.ipynb @@ -0,0 +1,154 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _**" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information \n", + "on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers \n", + "for tesla related news,\n", + "https://twitter.com/teslarati\n", + "https://twitter.com/dummy_tesla\n", + "https://twitter.com/dummy_2_tesla\n", + "'''\n", + "pattern = 'https://twitter\\.com/([a-zA-Z0-9_]+)'\n", + "\n", + "re.findall(pattern, text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**2. Extract Concentration Risk Types. It will be a text that appears after \"Concentration Risk:\", In below example, your regex should extract these two strings**\n", + "\n", + "(1) Credit Risk\n", + "\n", + "(2) Supply Rish" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Credit Risk', 'Credit Risk']" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "Concentration of Risk: Credit Risk\n", + "Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,\n", + "restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds\n", + "or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021\n", + "and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note\n", + "hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.\n", + "Concentration of Risk: Supply Risk\n", + "We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our\n", + "products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these\n", + "suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.\n", + "'''\n", + "pattern = 'Concentration of Risk: ([^\\n]*)'\n", + "\n", + "re.findall(pattern, text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below**\n", + "\n", + "Hint: you need to use (?:) here to match everything enclosed" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['2021 Q1', '2021 S1']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.\n", + "BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.\n", + "'''\n", + "\n", + "pattern = 'FY(\\d{4} (?:Q[1-4]|S[1-2]))'\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Advanced/regex/regex_tutorial_exercise_questions.ipynb b/Advanced/regex/regex_tutorial_exercise_questions.ipynb new file mode 100644 index 00000000..e82b52c8 --- /dev/null +++ b/Advanced/regex/regex_tutorial_exercise_questions.ipynb @@ -0,0 +1,135 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Python Regular Expression Tutorial Exericse

" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "text = '''\n", + "Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information \n", + "on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers \n", + "for tesla related news,\n", + "https://twitter.com/teslarati\n", + "https://twitter.com/dummy_tesla\n", + "https://twitter.com/dummy_2_tesla\n", + "'''\n", + "pattern = '' # todo: type your regex here\n", + "\n", + "re.findall(pattern, text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**2. Extract Concentration Risk Types. It will be a text that appears after \"Concentration Risk:\", In below example, your regex should extract these two strings**\n", + "\n", + "(1) Credit Risk\n", + "\n", + "(2) Supply Rish" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text = '''\n", + "Concentration of Risk: Credit Risk\n", + "Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,\n", + "restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds\n", + "or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021\n", + "and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note\n", + "hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.\n", + "Concentration of Risk: Supply Risk\n", + "We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our\n", + "products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these\n", + "suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.\n", + "'''\n", + "pattern = '' # todo: type your regex here\n", + "\n", + "re.findall(pattern, text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below**\n", + "\n", + "Hint: you need to use (?:) here to match everything enclosed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text = '''\n", + "Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.\n", + "BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.\n", + "'''\n", + "\n", + "pattern = '' # todo: type your regex here\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__[Solution](http://ndtv.com)__" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Advanced/regex/regex_tutorial_python.ipynb b/Advanced/regex/regex_tutorial_python.ipynb new file mode 100644 index 00000000..df1cdbb0 --- /dev/null +++ b/Advanced/regex/regex_tutorial_python.ipynb @@ -0,0 +1,329 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Extract phone numbers

" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['9991116666', '(999)-333-7777']" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text='''\n", + "Elon musk's phone number is 9991116666, call him if you have any questions on dodgecoin. Tesla's revenue is 40 billion\n", + "Tesla's CFO number (999)-333-7777\n", + "'''\n", + "pattern = '\\(\\d{3}\\)-\\d{3}-\\d{4}|\\d{10}'\n", + "\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Extract Note Titles

" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [], + "source": [ + "text = '''\n", + "Note 1 - Overview\n", + "Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop, manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage\n", + "products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.\n", + "Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines\n", + "against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection\n", + "rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges\n", + "and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor\n", + "supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to\n", + "administrative activities supporting our product deliveries and deployments.\n", + "Note 2 - Summary of Significant Accounting Policies\n", + "Unaudited Interim Financial Statements\n", + "The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of\n", + "comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September\n", + "30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information\n", + "disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited\n", + "consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in\n", + "conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year\n", + "ended December 31, 2020.\n", + "'''" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Overview', 'Summary of Significant Accounting Policies']" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pattern = 'Note \\d - ([^\\n]*)'\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Extract financial periods from a company's financial reporting

" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['FY2021 Q1', 'FY2020 Q4']" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.\n", + "In previous quarter i.e. FY2020 Q4 it was $3 billion. \n", + "'''\n", + "\n", + "pattern = 'FY\\d{4} Q[1-4]'\n", + "\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Case insensitive pattern match using flags**" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['FY2021 Q1', 'fy2020 Q4']" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.\n", + "In previous quarter i.e. fy2020 Q4 it was $3 billion. \n", + "'''\n", + "\n", + "pattern = 'FY\\d{4} Q[1-4]'\n", + "\n", + "matches = re.findall(pattern, text, flags=re.IGNORECASE)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Extract only financial numbers

" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['4.85', '3']" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. \n", + "In previous quarter i.e. FY2020 Q4 it was $3 billion.\n", + "'''\n", + "\n", + "pattern = '\\$([0-9\\.]+)'\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Extract periods and financial numbers both

" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[('2021 Q1', '4.85'), ('2020 Q4', '3')]" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. \n", + "In previous quarter i.e. FY2020 Q4 it was $3 billion.\n", + "'''\n", + "pattern = 'FY(\\d{4} Q[1-4])[^\\$]+\\$([0-9\\.]+)'\n", + "\n", + "matches = re.findall(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

re.search

" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text = '''\n", + "Tesla's gross cost of operating lease vehicles in FY2021 Q1 ljh lsj a 123 was $4.85 billion. Same number for FY2020 Q4 was $8 billion\n", + "'''\n", + "pattern = 'FY(\\d{4} Q[1-4])[^\\$]+\\$([0-9\\.]+)'\n", + "\n", + "matches = re.search(pattern, text)\n", + "matches" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('2021 Q1', '4.85')" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "matches.groups()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Advanced/regex/tesla_report_notes.jpg b/Advanced/regex/tesla_report_notes.jpg new file mode 100644 index 00000000..3d22a02b Binary files /dev/null and b/Advanced/regex/tesla_report_notes.jpg differ