Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ncas image v1.1 #38

Open
wants to merge 111 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
fb87ce3
First commit of image reader
joshua-hampton May 17, 2023
4ea4456
this is a test
Aug 7, 2023
493ebb2
Merge pull request #31 from cedadev/main
joshua-hampton Aug 7, 2023
e9ed8dd
_attrs_dict splits the line once at the first :
Aug 7, 2023
f6b688a
Add image reader to parse_file_header
Aug 7, 2023
0415305
Undo test comment
Aug 7, 2023
f2ffc5c
New file for NCAS image
Aug 7, 2023
ad9acf4
New file for NCAS image global attrs
Aug 7, 2023
b7026c7
#24 moving to use global attributes file
Aug 7, 2023
ea79ab9
#23 rename from image to photo
Aug 7, 2023
c3a7d96
#24 new plot file
Aug 7, 2023
2783d2d
#24 new image checks
Aug 8, 2023
0aa4d13
#24 removed plot/photo specific checks
Aug 8, 2023
72009e9
#24 tidying up after chat with Graham and Davey
Aug 8, 2023
aaef23e
#24 new rules for NCAS image
Aug 8, 2023
b2b2f32
# 24 import image reader
Aug 8, 2023
28e09a2
#24 import image reader
Aug 8, 2023
1b982ca
Merge branch 'ncas-image' of https://github.com/cedadev/checksit into…
Aug 8, 2023
fe3a8e4
#24 adapt to key/value format
Aug 9, 2023
8de3455
#24 import datetime
Aug 9, 2023
f33da82
#24 regex edits
Aug 9, 2023
2a77e88
#24 regex brackets correction
Aug 9, 2023
3f20290
#24 regex brackets correction
Aug 9, 2023
a661571
#24 swapping -args for -j due to newline/.
Aug 10, 2023
19acb9c
#24 specifying relation uuid length
Aug 10, 2023
e4c42dd
changing folder name
Aug 10, 2023
f8852d7
#24 adding warnings code
Aug 10, 2023
4ffac7a
#24 tidying up
Aug 10, 2023
a94fb16
#24 warnings for vocab_attrs
Aug 10, 2023
0cf0e53
#24 name warning
Aug 10, 2023
83b01ba
#24 relation warning
Aug 21, 2023
1d456ea
#24 rights warning
Aug 21, 2023
835b052
#24 WebStatement warning
Aug 21, 2023
ca064ee
#24 credit warning
Aug 21, 2023
22d12bc
#24 location must have at least one comma & space
Aug 21, 2023
68e4730
#24 location warning
Aug 21, 2023
edbbf1e
#24 Headline warning function
Aug 21, 2023
067c668
#24 check url exists
Aug 21, 2023
6671191
#24 url check ContributerIdentifier
Aug 21, 2023
7a58a2e
#24 WebStatement valid URL check
Aug 22, 2023
c4fa3d8
#24 relation url valid check
Aug 22, 2023
e4022a6
#24 change url valid checks to warnings
Aug 22, 2023
a38e7f7
#24 space optional in name regex
Aug 22, 2023
d75f6e5
#24 change list of names to a warning
Shanrahan16 Aug 22, 2023
bc010a7
#24 remove WebStatement valid URL check - regex ok
Shanrahan16 Aug 22, 2023
47a70c3
#24 title_check
Shanrahan16 Aug 22, 2023
73dd4f4
#24 compare Title to actual file name
Shanrahan16 Aug 23, 2023
2b61a16
#24 tidying up
Shanrahan16 Aug 23, 2023
d3f6f2f
#24 latitude/longitude range checks
Shanrahan16 Aug 23, 2023
ee27e80
#24 adding possiblility of - within lat/long regex
Shanrahan16 Aug 23, 2023
f9ae14f
#24 tidying up
Shanrahan16 Aug 24, 2023
bf2846d
#24 allow muliple checks for each metadata key
Shanrahan16 Aug 24, 2023
00b9519
#24 changing the yaml file to allow multiple
Shanrahan16 Aug 24, 2023
fb904ee
#24 tidying up
Shanrahan16 Aug 24, 2023
a982333
#24 correcting headline capital letter check
Shanrahan16 Aug 24, 2023
5af1750
#24 fixing title check
Shanrahan16 Aug 24, 2023
f8d32f3
#24 tidying up
Shanrahan16 Aug 24, 2023
deb379e
#24 test images
Shanrahan16 Aug 25, 2023
8e54b9c
#24 title data product warning
Shanrahan16 Aug 30, 2023
0182c3a
#24 error/warning wording
Shanrahan16 Aug 31, 2023
ff96edd
#24 new test images
Shanrahan16 Aug 31, 2023
324df5c
#24 test images
Shanrahan16 Aug 31, 2023
3d83f7e
#35 data_product in title must be 'plot'/'photo'
Shanrahan16 Aug 31, 2023
f7cc17a
#35 instrument controlled vocab check
Shanrahan16 Sep 1, 2023
6f98d3f
#35 platform controlled vocab check
Shanrahan16 Sep 1, 2023
71f1937
#35 adding community instruments to vocab check
Shanrahan16 Sep 1, 2023
622fc13
#24 updating for errors & warnings being returned
Shanrahan16 Sep 4, 2023
d94b392
#24 tidying up
Shanrahan16 Sep 4, 2023
1cec480
#24 resolves list index - warnings for vocab_attrs
Shanrahan16 Sep 5, 2023
986e5d7
#24 addresses key error from inpt
Shanrahan16 Sep 5, 2023
49f92bf
#24 reducing traceback output when filepath wrong
Shanrahan16 Sep 5, 2023
14cd3da
#24 adding requests to requirements
Shanrahan16 Sep 6, 2023
911588f
#24 allow decimal altitudes-warning if not integer
Shanrahan16 Sep 6, 2023
54c3221
#24 location- allowing digits, hyphens, accents...
Shanrahan16 Sep 6, 2023
932903d
#24 ncas email warning
Shanrahan16 Sep 6, 2023
eb21897
#24 valid email error
Shanrahan16 Sep 6, 2023
bf0ef8a
#24 allow apostrophes, hyphens & accents in names
Shanrahan16 Sep 6, 2023
c3edfae
#24 all characters for names
Shanrahan16 Sep 6, 2023
036a815
#24 allow special characters in title
Shanrahan16 Sep 6, 2023
06671da
#24 combining if statements
Shanrahan16 Sep 7, 2023
9f4f201
#24 reorder so functn won't error out if <32 char
Shanrahan16 Sep 7, 2023
48462fe
#24 raise exception if no space in relation
Shanrahan16 Sep 7, 2023
156eafe
#24 changing Python error to a checksit error
Shanrahan16 Sep 7, 2023
f54cbec
#24 tidying up
Shanrahan16 Sep 7, 2023
bbc3f4f
#24 name characters warning
Shanrahan16 Sep 7, 2023
e3f1d3b
#24 renaming name-format regex
Shanrahan16 Sep 7, 2023
1470757
#24 missing comma
Shanrahan16 Sep 7, 2023
ca803c5
#24 name characters separate from format
Shanrahan16 Sep 7, 2023
d6ce055
#24 allow any metadata tags in image reader
Shanrahan16 Sep 7, 2023
4eb3a04
#24 removing tags dictionary as no longer used
Shanrahan16 Sep 7, 2023
4fe1e59
#24 test images
Shanrahan16 Sep 8, 2023
ae016de
#24 fixing relation url error output
Shanrahan16 Sep 8, 2023
c93b664
#24 tidying up
Shanrahan16 Sep 8, 2023
6373fdd
#24 more test images
Shanrahan16 Sep 8, 2023
d771550
#24 stopping empty warning being returned- url
Shanrahan16 Sep 8, 2023
0e8d1b5
#24 tidying up & stop url redirecting
Shanrahan16 Sep 8, 2023
cffa97d
#35 data_product in title must be 'plot'/'photo'
Shanrahan16 Aug 31, 2023
610c1de
rebasing
Shanrahan16 Sep 1, 2023
fba5eb3
#35 platform controlled vocab check
Shanrahan16 Sep 1, 2023
88137d1
#35 adding community instruments to vocab check
Shanrahan16 Sep 1, 2023
943ff0b
Merge branch ncas-image-v1.1 into ncas-image-v1.1
Shanrahan16 Sep 8, 2023
bf7d79a
#35 tidying up
Shanrahan16 Sep 8, 2023
58a9e45
#24 removing url requests.get()
Shanrahan16 Sep 21, 2023
5fe4f4d
#24 tidying up
Shanrahan16 Sep 21, 2023
e4b090d
#35 data_product in title must be 'plot'/'photo'
Shanrahan16 Aug 31, 2023
b052708
rebasing
Shanrahan16 Sep 1, 2023
204c661
#35 platform controlled vocab check
Shanrahan16 Sep 1, 2023
643de89
#35 adding community instruments to vocab check
Shanrahan16 Sep 1, 2023
dab2ad4
merging
Shanrahan16 Sep 1, 2023
4a04ec4
#35 tidying up
Shanrahan16 Sep 8, 2023
bb94386
Merge branch 'ncas-image-v1.1' of https://github.com/cedadev/checksit…
Shanrahan16 Sep 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion checksit/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

from .cvs import vocabs, vocabs_prefix
from .rules import rules, rules_prefix
from .readers import pp, badc_csv, cdl, yml
from .readers import pp, badc_csv, cdl, yml, image
from .specs import SpecificationChecker
from .utils import get_file_base, extension, UNDEFINED
from .config import get_config
Expand Down Expand Up @@ -385,6 +385,8 @@ def parse_file_header(self, file_path, auto_cache=False, verbose=False):
reader = badc_csv
elif ext in ("yml"):
reader = yml
elif ext in ("png", "PNG", "jpg", "JPG", "jpeg", "JPEG"):
reader = image
else:
raise Exception(f"No known reader for file with extension: {ext}")

Expand Down
9 changes: 7 additions & 2 deletions checksit/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,10 @@ def check_global_attrs(dct, defined_attrs=None, vocab_attrs=None, regex_attrs=No
errors.append(f"[global-attributes:**************:{attr}]: No value defined for attribute '{attr}'.")
else:
errors.extend(vocabs.check(vocab_attrs[attr], dct["global_attributes"].get(attr), label=f"[global-attributes:******:{attr}]***"))

#vocab_check_output = vocabs.check(vocab_attrs[attr], dct["global_attributes"].get(attr), label=f"[global-attributes:******:{attr}]***")
#warnings.extend(vocab_check_output[1])
#errors.extend(vocab_check_output[0])

for attr in regex_attrs:
if attr not in dct['global_attributes']:
errors.append(
Expand All @@ -118,7 +121,9 @@ def check_global_attrs(dct, defined_attrs=None, vocab_attrs=None, regex_attrs=No
elif is_undefined(dct['global_attributes'].get(attr)):
errors.append(f"[global-attributes:**************:{attr}]: No value defined for attribute '{attr}'.")
else:
errors.extend(rules.check(rules_attrs[attr], dct['global_attributes'].get(attr), label=f"[global-attributes:******:{attr}]***"))
rules_check_output = rules.check(rules_attrs[attr], dct['global_attributes'].get(attr), context=dct['inpt'], label=f"[global-attributes:******:{attr}]***")
warnings.extend(rules_check_output[1])
errors.extend(rules_check_output[0])


return errors, warnings
Expand Down
7 changes: 5 additions & 2 deletions checksit/readers/cdl.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import re
import yaml
import subprocess as sp
import sys

from ..cvs import vocabs, vocabs_prefix

Expand Down Expand Up @@ -40,7 +41,8 @@ def _parse(self, inpt):

for s in self.CDL_SPLITTERS:
if s not in cdl_lines:
raise Exception(f"Invalid file or CDL contents provided: '{inpt[:100]}...'")
print(f"Please check your command - invalid file or CDL contents provided: '{inpt[:100]}...'")
sys.exit(1)

sections = self._get_sections(cdl_lines, split_patterns=self.CDL_SPLITTERS, start_at=1)

Expand Down Expand Up @@ -188,7 +190,8 @@ def to_yaml(self):
def to_dict(self):
return {"dimensions": self.dimensions,
"variables": self.variables,
"global_attributes": self.global_attrs}
"global_attributes": self.global_attrs,
"inpt": self.inpt}


def read(fpath, verbose=False):
Expand Down
60 changes: 60 additions & 0 deletions checksit/readers/image.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import subprocess as sp
import yaml

def get_output(cmd):
subp = sp.Popen(cmd, shell=True, stdout=sp.PIPE, stderr=sp.PIPE)
return subp.stdout.read().decode("charmap"), subp.stderr.read().decode("charmap")


class ImageParser:

def __init__(self, inpt, verbose=False):
self.inpt = inpt
self.verbose = verbose
self.base_exiftool_arguments = ["exiftool", "-G1", "-j", "-c", "%+.6f"]
self._find_exiftool()
self._parse(inpt)

def _parse(self, inpt):
if self.verbose: print(f"[INFO] Parsing input: {inpt[:100]}...")
self.global_attrs = {}
exiftool_arguments = self.base_exiftool_arguments + [inpt]
exiftool_return_string = sp.check_output(exiftool_arguments)
raw_global_attrs = yaml.load(exiftool_return_string, Loader=yaml.SafeLoader)[0]
for tag_name in raw_global_attrs.keys():
value_type = type(raw_global_attrs[tag_name])
if value_type == list:
self.global_attrs[tag_name] = str(raw_global_attrs[tag_name][0])
else:
self.global_attrs[tag_name] = str(raw_global_attrs[tag_name])

def _find_exiftool(self):
if self.verbose: print("[INFO] Searching for exiftool...")
which_output, which_error = get_output("which exiftool")
if which_error.startswith("which: no exiftool in"):
msg = (
f"'exiftool' required to read image file metadata but cannot be found.\n"
f" Visit https://exiftool.org/ for information on 'exiftool'."
)
raise RuntimeError(msg)
else:
self.exiftool_location = which_output.strip()
if self.verbose: print(f"[INFO] Found exiftool at {self.exiftool_location}.")

def _attrs_dict(self,content_lines):
attr_dict = {}
for line in content_lines:
if self.verbose: print(f"WORKING ON LINE: {line}")
key_0 = line.split("=",1)[0].strip()
key = key_0[1:] #removes first character - unwanted quotation marks
value = line.split("=",1)[1].strip()
attr_dict[key] = value
return attr_dict

def to_dict(self):
return {"global_attributes": self.global_attrs, "inpt": self.inpt}


def read(fpath, verbose=False):
return ImageParser(fpath, verbose=verbose)

221 changes: 221 additions & 0 deletions checksit/rules/rule_funcs.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
import os
import re
from datetime import datetime
import requests
import json
import pandas as pd
from urllib.request import urlopen
import json
import pandas as pd

from . import processors
from ..config import get_config
from pandas import json_normalize

conf = get_config()
rule_splitter = conf["settings"].get("rule_splitter", "|")
Expand Down Expand Up @@ -82,3 +90,216 @@ def string_of_length(value, context, extras=None, label=""):
errors.append(f"{label} '{value}' must be exactly {min_length} characters")

return errors


def validate_image_date_time(value, context, extras=None, label=""):
"""
A function to indifity if a date-time value is compatible with the NCAS image standard
"""
errors = []

try:
if value != datetime.strptime(value, "%Y:%m:%d %H:%M:%S").strftime("%Y:%m:%d %H:%M:%S") and value != datetime.strptime(value, "%Y:%m:%d #%H:%M:%S.%f").strftime("%Y:%m:%d %H:%M:%S.%f"):
errors.append(f"{label} '{value}' needs to be of the format YYYY:MM:DD hh:mm:ss or YYYY:MM:DD hh:mm:ss.s")
except ValueError:
errors.append(f"{label} '{value}' needs to be of the format YYYY:MM:DD hh:mm:ss or YYYY:MM:DD hh:mm:ss.s")

return errors


def validate_orcid_ID(value, context, extras=None, label=""):
"""
A function to verify the format of an orcid ID
"""
orcid_string = "https://orcid.org/" # required format of start of the string

errors = []

PI_orcid_digits = value[-19:]
PI_orcid_digits_only = PI_orcid_digits.replace("-", "")

# Check that total the length is correct
if len(value) != 37:
errors.append(f"{label} '{value}' needs to be of the format https://orcid.org/XXXX-XXXX-XXXX-XXXX")

# Check the start of the string (first 18 characters)
elif (value[0:18] != orcid_string or

# Check that the "-" are in the correct places
value[22] != "-" or
value[27] != "-" or
value[32] != "-" or

# Check that the last characters contain only "-" and digits
not PI_orcid_digits_only.isdigit):

errors.append(f"{label} '{value}' needs to be of the format https://orcid.org/XXXX-XXXX-XXXX-XXXX")

return errors


def list_of_names(value, context, extras=None, label=""):
"""
A function to verify the names of people when a list of names may be provided
"""
name_pattern = r'(.)+, (.)+ ?((.)+|((.)\.))' # The format names should be written in
character_name_pattern = r'[A-Za-z_À-ÿ\-\'\ \.\,]+'

warnings = []

if type(value) == list:
for i in value:
if not re.fullmatch(name_pattern, i):
warnings.append(f"{label} '{value}' should be of the format <last name>, <first name> <middle initials(s)> or <last name>, <first name> <middle name(s)> where appropriate")
if not re.fullmatch(character_name_pattern, i):
warnings.append(f"{label} '{value}' - please use characters A-Z, a-z, À-ÿ where appropriate")

if type(value) == str:
if not re.fullmatch(name_pattern, value):
warnings.append(f"{label} '{value}' should be of the format <last name>, <first name> <middle initials(s)> or <last name>, <first name> <middle name(s)> where appropriate")
if not re.fullmatch(character_name_pattern, value):
warnings.append(f"{label} '{value}' - please use characters A-Z, a-z, À-ÿ where appropriate")

return warnings


def headline(value, context, extras=None, label=""):
"""
A function to verify the format of the Headline
"""
warnings = []

if len(value) > 150:
warnings.append(f"{label} '{value}' should contain no more than one sentence")

if value.count(".") >= 2:
warnings.append(f"{label} '{value}' should contain no more than one sentence")

if not value[0].isupper():
warnings.append(f"{label} '{value}' should start with a capital letter")

if len(value) < 10:
warnings.append(f"{label} '{value}' should be at least 10 characters")

return warnings


def title_check(value, context, extras=None, label=""):
"""
A function to check if the title matches the system filename
"""
errors = []

if value != os.path.basename(context) :
errors.append(f"{label} '{value}' must match the name of the file")

return errors


def title_instrument(value, context, extras=None, label=""):
"""
A function to check if the instrument in the title is contained in the controlled vocabulary lists
"""
warnings = []

instrument = value.partition("_")[0]

# open JSON controlled vocab files:
n = open ('./checksit/vocabs/AMF_CVs/2.0.0/AMF_ncas_instrument.json', "r")
c = open ('./checksit/vocabs/AMF_CVs/2.0.0/AMF_community_instrument.json', "r")

## Reading from file:
ncas_data = json.loads(n.read())
community_data = json.loads(c.read())

if instrument not in ncas_data['ncas_instrument'] and instrument not in community_data['community_instrument']:
warnings.append(f"{label} '{instrument}' should be contained one of the instrument controlled vocabulary lists")

# Closing file
n.close()
c.close()

return warnings

def title_platform(value, context, extras=None, label=""):
"""
A function to check if the platform in the title is contained in the controlled vocabulary list
"""
warnings = []

platform = value.split("_")[1]

# open JSON controlled vocab file:
g = open ('./checksit/vocabs/AMF_CVs/2.0.0/AMF_platform.json', "r")

## Reading from file:
data = json.loads(g.read())

if platform not in data['platform']:
warnings.append(f"{label} '{platform}' should be contained in the platform controlled vocabulary list")

# Closing file
g.close()

return warnings

def url_checker(value, context, extras=None, label=""):
"""
A function to check if the url exists
"""
warnings = []

try: url=urlopen(value)
except:
warnings.append(f"{label} '{value}' is not a reachable url")
else:
if url.getcode() != 200: # (200 means it exists and is up and reachable)
warnings.append(f"{label} '{value}' is not a reachable url")
finally:
return warnings


def relation_url_checker(value, context, extras=None, label=""):
"""
A function to check if Relation field is in the correct format, and that the url exists
"""
errors = []

if " " not in value:
errors.append(f"{label} '{value}' should contain a space before the url")
else:
relation_url = value.partition(" ")[2] # extract only the url part of the relation string
if url_checker(relation_url, context, extras, label) != []:
errors.append(url_checker(relation_url, context, extras, label)) # check the url exists using the url_checker() function defined above

return errors


def latitude(value, context, extras=None, label=""):
"""
A function to check if the latitude is within -90 and +90
"""
errors = []

latitude = re.findall(r'[0-9]+', value)[0]
int_latitude = int(latitude)

if int_latitude > 90:
errors.append(f"{label} '{value}' must be within -90 and +90 ")

return errors


def longitude(value, context, extras=None, label=""):
"""
A function to check if the longitude is within -180 and +180
"""
errors = []

longitude = re.findall(r'[0-9]+', value)[0]
int_longitude = int(longitude)

if int_longitude > 180:
errors.append(f"{label} '{value}' must be within -180 and +180 ")

return errors
Loading