-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
listando e baixando arquivos do drive #1
Open
joellensilva
wants to merge
4
commits into
main
Choose a base branch
from
criando-coletor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+350
−1
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
credentials.json | ||
venv | ||
__pycache__ | ||
lista_planilhas_baixadas.csv | ||
.env |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# set base image (host OS) | ||
FROM python:3.8-slim-buster | ||
|
||
# set the working directory in the container | ||
WORKDIR /code | ||
|
||
# copy the dependencies file to the working directory | ||
COPY requirements.txt . | ||
|
||
# install dependencies | ||
RUN pip install -r requirements.txt | ||
|
||
# copy the content of the local directory to the working directory | ||
COPY src/ . | ||
|
||
# copy the content of the local directory to the working directory | ||
COPY credentials.json . | ||
|
||
# command to run on container start | ||
CMD [ "python", "./main.py" ] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,38 @@ | ||
# coletor-mp-manuais | ||
# Coletor MPs Manuais | ||
|
||
Este coletor possui 2 funções: | ||
|
||
1. Criar e atualizar listagem de arquivos no drive (planilhas de contracheques e indenizações coletados manualmente) | ||
2. Realizar o download dos arquivos do órgão/ano/mês de referência dentro do pipeline de coleta (no qual os dados serão tratados, padronizados e armazenados). | ||
|
||
## Variáveis de ambiente e arquivos necessários | ||
|
||
As variáveis de ambiente poderão ser passadas pelo próprio comando, e.g. `FILE_ID={} python3...` ou por um arquivo .env. | ||
|
||
- `credentials.json`: arquivo .json contendo as credenciais da conta de serviço. | ||
- `DATA_FOLDER_ID`: ID da pasta no drive no qual estão armazenadas as pastas de dados de cada órgão. | ||
- `FILE_ID`: ID do arquivo no drive (lista_planilhas_baixadas.csv). O arquivo será atualizado e também utilizado para consultar os arquivos durante o pipeline de coleta. | ||
|
||
## Antes de tudo... | ||
|
||
Crie um ambiente virtual e baixe os pacotes necessários: | ||
|
||
```{sh} | ||
python3 -m venv venv | ||
source venv/bin/activate | ||
pip3 install -r requirements.txt | ||
``` | ||
|
||
## Como atualizar a lista de arquivos | ||
|
||
```{sh} | ||
python3 src/list_drive_files.py | ||
``` | ||
|
||
## Como fazer o download dos arquivos | ||
|
||
Passamos como parâmetro `COURT`, `MONTH`, `YEAR` e `OUTPUT_FOLDER`, i.e. órgão, ano e mês de referência e diretório no qual os arquivos serão armazenados, respectivamente. | ||
|
||
```{sh} | ||
COURT=MPPA MONTH=06 YEAR=2022 OUTPUT_FOLDER=. python3 src/main.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
google-auth==2.35.0 | ||
google-auth-oauthlib==1.2.1 | ||
google-auth-httplib2==0.2.0 | ||
google-api-python-client==2.146.0 | ||
pandas>=2.0.3 | ||
python-dotenv>=0.20.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
from google.oauth2 import service_account | ||
from googleapiclient.discovery import build | ||
from googleapiclient.http import MediaIoBaseDownload | ||
import io | ||
import pandas as pd | ||
import pathlib | ||
import sys | ||
|
||
STATUS_DATA_UNAVAILABLE = 4 | ||
|
||
# Caminho para o arquivo JSON da conta de serviço | ||
SERVICE_ACCOUNT_FILE = "credentials.json" | ||
|
||
# Escopos necessários para acessar o Google Drive | ||
SCOPES = ["https://www.googleapis.com/auth/drive.readonly"] | ||
|
||
# Autentica usando as credenciais da conta de serviço | ||
creds = service_account.Credentials.from_service_account_file( | ||
SERVICE_ACCOUNT_FILE, scopes=SCOPES | ||
) | ||
|
||
|
||
def download_list(file_id): | ||
# Conecta-se à API do Google Drive | ||
service = build("drive", "v3", credentials=creds) | ||
|
||
file_name = "lista_planilhas_baixadas.csv" | ||
|
||
request = service.files().get_media(fileId=file_id, supportsAllDrives=True) | ||
|
||
fh = io.FileIO(file_name, "wb") | ||
downloader = MediaIoBaseDownload(fh, request) | ||
|
||
done = False | ||
while not done: | ||
status, done = downloader.next_chunk() | ||
|
||
|
||
def consult_list(orgao, mes, ano): | ||
sheets_list = pd.read_csv("lista_planilhas_baixadas.csv") | ||
filter_list = sheets_list[ | ||
(sheets_list.orgao == orgao) | ||
& (sheets_list.mes == mes) | ||
& (sheets_list.ano == ano) | ||
] | ||
|
||
# Se os arquivos referentes ao órgão/mês/ano não existirem, retornamos status 4 | ||
if filter_list.empty: | ||
sys.stderr.write( | ||
f"Não existe planilhas para {orgao}/{mes}/{ano}." | ||
) | ||
sys.exit(STATUS_DATA_UNAVAILABLE) | ||
|
||
return filter_list | ||
|
||
|
||
def download_files(output_path, filter_list): | ||
# Pegamos a data e hora que o primeiro arquivo, do respectivo órgão/mês/ano, foi armazenado | ||
timestamp = filter_list.data.min() | ||
ts_files = [timestamp] | ||
|
||
# Cria diretório, se não houver | ||
pathlib.Path(output_path).mkdir(exist_ok=True) | ||
|
||
# Conecta-se à API do Google Drive | ||
service = build("drive", "v3", credentials=creds) | ||
|
||
for row in filter_list.to_numpy(): | ||
# ID do arquivo | ||
file_id = row[4] | ||
|
||
# Nome para salvar o arquivo localmente | ||
file_name = output_path + "/" + row[3] | ||
|
||
ts_files.append(file_name) | ||
|
||
# Solicitar o arquivo da API do Google Drive | ||
request = service.files().get_media(fileId=file_id, supportsAllDrives=True) | ||
|
||
# Fazer o download do arquivo | ||
fh = io.FileIO(file_name, "wb") | ||
downloader = MediaIoBaseDownload(fh, request) | ||
|
||
done = False | ||
while not done: | ||
status, done = downloader.next_chunk() | ||
|
||
return ts_files |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
from google.oauth2 import service_account | ||
from googleapiclient.discovery import build | ||
from googleapiclient.http import MediaFileUpload | ||
import os | ||
import csv | ||
import sys | ||
from dotenv import load_dotenv | ||
|
||
load_dotenv() | ||
|
||
|
||
# ID da pasta que contém todas as planilhas a serem listas | ||
if "DATA_FOLDER_ID" in os.environ: | ||
DATA_FOLDER_ID = os.environ["DATA_FOLDER_ID"] | ||
else: | ||
sys.stderr.write("Invalid arguments, missing parameter: 'DATA_FOLDER_ID'.\n") | ||
os._exit(1) | ||
|
||
# ID da lista que será atualizada no drive | ||
if "FILE_ID" in os.environ: | ||
FILE_ID = os.environ["FILE_ID"] | ||
else: | ||
sys.stderr.write("Invalid arguments, missing parameter: 'FILE_ID'.\n") | ||
os._exit(1) | ||
|
||
# Caminho para o arquivo JSON da conta de serviço | ||
SERVICE_ACCOUNT_FILE = "credentials.json" | ||
|
||
# Escopos necessários para acessar o Google Drive | ||
SCOPES = ["https://www.googleapis.com/auth/drive"] | ||
|
||
# Autentica usando as credenciais da conta de serviço | ||
creds = service_account.Credentials.from_service_account_file( | ||
SERVICE_ACCOUNT_FILE, scopes=SCOPES | ||
) | ||
|
||
# Conecta-se à API do Google Drive | ||
service = build("drive", "v3", credentials=creds) | ||
|
||
|
||
def list_folders(): | ||
folders = [] | ||
page_token = None | ||
|
||
# Listar ID das pastas (de cada órgão) dentro da pasta especificada | ||
while True: | ||
results = ( | ||
service.files() | ||
.list( | ||
q=f"'{DATA_FOLDER_ID}' in parents", | ||
pageSize=100, | ||
fields="nextPageToken, files(id)", | ||
supportsAllDrives=True, | ||
includeItemsFromAllDrives=True, | ||
pageToken=page_token, | ||
) | ||
.execute() | ||
) | ||
|
||
folders.extend(results.get("files", [])) | ||
page_token = results.get("nextPageToken", None) | ||
|
||
if not page_token: | ||
break | ||
|
||
if not folders: | ||
print("Pasta não encontrada.") | ||
os._exit(1) | ||
else: | ||
return folders | ||
|
||
|
||
def list_files(folders): | ||
files = [] | ||
for folder in folders: | ||
page_token = None | ||
|
||
while True: | ||
results = ( | ||
service.files() | ||
.list( | ||
q=f"'{folder['id']}' in parents", | ||
pageSize=100, | ||
fields="nextPageToken, files(id, name, createdTime)", | ||
supportsAllDrives=True, | ||
includeItemsFromAllDrives=True, | ||
pageToken=page_token, | ||
) | ||
.execute() | ||
) | ||
|
||
files.extend(results.get("files", [])) | ||
page_token = results.get("nextPageToken", None) | ||
|
||
if not page_token: | ||
break | ||
|
||
return files | ||
|
||
|
||
def create_csv(files): | ||
list_path = "lista_planilhas_baixadas.csv" | ||
with open(list_path, mode="w", newline="", encoding="utf-8") as csv_list: | ||
csv_writer = csv.writer(csv_list) | ||
|
||
# Criando o cabeçalho | ||
csv_writer.writerows([["orgao", "mes", "ano", "arquivo", "id_arquivo", "data"]]) | ||
|
||
for file in files: | ||
# @old é o nome da pasta criada para armazenar planilhas "velhas", | ||
# i.e. que foram baixadas, mas estavam quebradas/erradas e foram armazenadas novas | ||
if file['name'] != '@old': | ||
# removendo a extensão | ||
filename = os.path.splitext(file["name"])[0] | ||
|
||
# Dividir a string pelo delimitador '-' | ||
parts = filename.split("-") | ||
|
||
orgao = parts[0].lower() | ||
mes = parts[2] | ||
ano = parts[3] | ||
|
||
csv_writer.writerows( | ||
[[orgao, mes, ano, file["name"], file["id"], file["createdTime"]]] | ||
) | ||
|
||
return list_path | ||
|
||
|
||
def upload_list(list_path): | ||
# Armazenando o csv no drive | ||
file_name = os.path.basename(list_path) | ||
|
||
# Upload do arquivo | ||
media = MediaFileUpload(list_path, mimetype="text/csv") | ||
file = ( | ||
service.files() | ||
.update(fileId=FILE_ID, media_body=media, supportsAllDrives=True) | ||
.execute() | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
folders = list_folders() | ||
files = list_files(folders) | ||
list_path = create_csv(files) | ||
upload_list(list_path) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
import sys | ||
import os | ||
import crawler | ||
|
||
|
||
if "COURT" in os.environ: | ||
court = os.environ["COURT"].casefold() | ||
else: | ||
sys.stderr.write("Invalid arguments, missing parameter: 'COURT'.\n") | ||
os._exit(1) | ||
|
||
if "YEAR" in os.environ: | ||
year = int(os.environ["YEAR"]) | ||
else: | ||
sys.stderr.write("Invalid arguments, missing parameter: 'YEAR'.\n") | ||
os._exit(1) | ||
|
||
if "MONTH" in os.environ: | ||
month = int(os.environ["MONTH"]) | ||
else: | ||
sys.stderr.write("Invalid arguments, missing parameter: 'MONTH'.\n") | ||
os._exit(1) | ||
|
||
if "OUTPUT_FOLDER" in os.environ: | ||
output_path = os.environ["OUTPUT_FOLDER"] | ||
else: | ||
output_path = "./output" | ||
|
||
# ID da lista no drive, referente às planilhas baixadas manualmente | ||
if "FILE_ID" in os.environ: | ||
file_id = os.environ["FILE_ID"] | ||
else: | ||
sys.stderr.write("Invalid arguments, missing parameter: 'FILE_ID'.\n") | ||
os._exit(1) | ||
|
||
# Baixamos a lista de arquivos | ||
crawler.download_list(file_id) | ||
|
||
# Consultamos se os arquivos existem | ||
result = crawler.consult_list(court, month, year) | ||
|
||
# Baixamos os arquivos | ||
stdout = crawler.download_files(output_path, result) | ||
|
||
# Retornamos o timestamp e o caminho dos arquivos | ||
print('\n'.join(stdout)) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Esse arquivo vai ficar onde onde?
Perguntando pois, ao invés de um arquivo local, podemos transformar ele em um segredo enviado via variável de ambiente. Se a API exigir que ele seja um arquivo a gente deve persistir em um arquivo o conteúdo da veriável.
Isso evita que arquivos com credenciais fiquem armazenados em máquinas locais.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conversei com a @joellensilva e ela sugeriu criar uma autorização/token no próprio github. Eu não sei dizer qual é a melhor solução para usar essas credenciais, mas guardei os arquivos
credentials.json
e.env
no cofre da TB se precisarem fazer um backup - esses dados ficarão guardados em segurança e poderão ser recuperados a qualquer momento.