Skip to content

Latest commit

 

History

History
476 lines (356 loc) · 17.8 KB

README.md

File metadata and controls

476 lines (356 loc) · 17.8 KB

Reader

The present file documents the Reader module.

Table of contents


Module Structure

reader/
       ├─── java/parser/
       |                ├── extraction/
       |                |              └── ExtractorJava.java      ; Java interface for this module's API
       |                └─────── generator.utils/
       |                               ├── POSTagEnum.java         ; Enum for the possible POST tag values
       |                               └── SpecificationJava.java  ; Factory class that creates scala-made Sepcification objects
       └── scala/parser/
                        ├── extraction/
                        |              ├── Extractor.scala         ; Handles the PDF parsing and JSON generation
                        |              └── FileHandler.scala       ; Handles the file inputs
                        └─────── generator.utils/
                                       ├── ImageProcessing.scala   ; Handles processing the image and extract its text
                                       ├── OpenNLP.scala           ; Handles the NLP (natural language processing) functionalities
                                       ├── Specification.scala     ; Classes that help specify the keywords sent when extracting information
                                       └── SpellChecker.scala      ; Handles the spellchecking operations to improve the OCR's accuracy

Main Features

This module of Flipper is dedicated entirely to parsing a PDF document and returning a JSON object with useful information extracted from the document. In order to achieve this goal this module implements features such as:

  • Extracting the text content of a PDF document - Using Apache's PDF Box we're able to extract all of the text information inside the PDF doc.

  • Searching values for keywords in a text - In order to extract the correct information to be returned in the JSON object the user has to send a List containing keywords that he/she wants to find values for. Example if we were to pass in a list containing only a "Name" keyword Flipper would do it's best to find a value (in this case a name) for that keyword, and in this particular case would return {"name" : <name that was found> }.

  • Extracting text from images - Using Tess4J and Scrimage Flipper is also able to extract text from images applying an OCR (optical character recognition) with great accuracy in order to maximize the possibility of extracting useful information

We will dive deeper in how you can go by doing this yourself on your project in a second.


Main Methods / Examples

In this section we will show some examples (both in Scala and Java) of the features above as well as a short API documentation for this module in order for you to find what you need.

  • Extracting text from the PDF doc and its images

To extract the text from a PDF document using Flipper simply pass the document path to readPDF (found in Extractor.scala or ExtractorJava.java).

Scala

    import parser.extraction.Extractor._
    import java.io.File
    
    val file = new File("./path/to/pdf/document")
    val extractedText = readPDF(file)

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
         }
    }

You now have the have the extracted text, wrapped in an Option[String] (or just a plain String if you're using Java's interface) to prevent null's in case the file does not exist.

  • Parsing PDF and returning a List of JSON Objects

The most straight-forward way to use this module's API is to call getJSONObjects. You need to supply this method with the text you want to extract data from and a Map of keywords for which you want to obtain values.


The keywords Map is a map of Keywords and a Specification, the specification has three different cases,these are MultipleOf, in which the user inputs a list of options and the return is a list of the options found in the text, OneOf, like the previous case the user inputs a list of choices, but the return is a single option from which to choose, and POSTag, this is for the Natural Language Processor (Apache's OpenNLP) Fliepper uses in order to improve the odds of finding a useful value for a given keyword. This POS tag simply tells Flipper what kind of value you want to obtain for a given keyword, the possible POSTags can be found bellow:

Possible POSTags
Adjective
ProperNoun
Noun
PluralNoun
Verb
VerbPastParticiple
VerbGerund
Number
Adverb

You can also send a Boolean value to these Specification classes, specifying further if the values to be found for a particular keyword should be multiple values (a list of values) or just a single value.


You can now implement the following snippet:

Scala

    import parser.extraction.Extractor.{readPDF, getJSONObjects}
    import java.io.File
    import parser.generator.utils.{ProperNoun, Number}
    
    val file = new File("./path/to/pdf/document")
    val extractedText = readPDF(file)
    val keywords = Map("name"-> ProperNoun(), "age" -> Number(), "phone" -> Number(true))
    
    val jsonObjs : List[String] = getJSONObjects(extractedText, keywords)
    
    //jsonObjs -> List(
    //                "{ "name" : "John Doe" , "age" : 21, "phone" : [01234, 56789] }",
    //                "{ "name" : "Jane Doe" , "age" : 22, "phone" : [40312, 95867] }"
    //                )

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    import parser.generator.utils.POSTagEnum;
    import parser.generator.utils.SpecificationJava;
    import java.util.HashMap;
    import java.util.List;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
            
            HashMap keywords = new HashMap<>();
            keywords.put("name", SpecificationJava.postTag(POSTagEnum.PROPERNOUN));
            keywords.put("age", SpecificationJava.postTag(POSTagEnum.NUMBER));
            keywords.put("phone", SpecificationJava.postTag(POSTagEnum.NUMBER, true));
            
            List jsonOBjs = ex.getJSONObjects(extractedText, keywords);
            
            //jsonObjs -> List(
            //                "{ "name" : "John Doe" , "age" : 21, "phone" : [01234, 56789] }",
            //                "{ "name" : "Jane Doe" , "age" : 22, "phone" : [40312, 95867] }"
            //                )
         }
    }

You can also send an optional flag to getJSONObjects specifying how you want the JSON to be outputed when a keyword has no value:

Possibilities:

  • "empty" (Default)    - { "name" : "John Doe", "age": "" }
  • "null"                       - { "name" : "John Doe", "age": null }
  • "remove"                 - { "name" : "John Doe" }

This would return you a List of JSON objects in the form of Strings. If the user wants a single JSON String with all the information, found the following function returns all the information found for each defined keyword in one single Json String.

Scala

    import parser.extraction.Extractor.{readPDF, getSingleJSON}
    import java.io.File
    import parser.generator.utils.{ProperNoun, Number}
    
    val file = new File("./path/to/pdf/document")
    val extractedText = readPDF(file)
    val keywords = Map("name"-> ProperNoun(), "age" -> Number())
    
    val jsonObj : String = getSingleJSON(extractedText, keywords)
    
    //jsonObj -> {"name" : ["John Doe", "Jane Doe"], "age" : [21, 25]}

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    import parser.generator.utils.POSTagEnum;
    import parser.generator.utils.SpecificationJava;
    import java.util.HashMap;
    import java.util.List;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
            
            HashMap keywords = new HashMap<>();
            keywords.put("name", SpecificationJava.postTag(POSTagEnum.PROPERNOUN));
            keywords.put("age", SpecificationJava.postTag(POSTagEnum.NUMBER));
            
            
            List jsonOBjs = ex.getSingleJSON(text, keywords);
            
            //jsonOBj -> {"name" : ["John Doe", "Jane Doe"], "age" : [21, 25]}
         }
    }

Flipper also provides a more in-depth API in case you want a Map of keywords and the values found for them instead of a JSON object which we will see next.

  • Specifying values to find in the text

In some cases you might want to specify certain possible values to be found in the text, if you want to find only one of those possible values or multiple of them, you can achive that by using the OneOf / MultipleOf classes (in Java you use the SpecificationJava factory class that has methods for creating these scala classes, but we'll get to that in a second). You only have to pass an additional List containing the possible values you want to find, like so:

Scala

    import parser.extraction.Extractor.{readPDF, getJSONObjects}
    import java.io.File
    import parser.generator.utils.{ProperNoun, Number, OneOf, MultipleOf}
    
    val file = new File("./path/to/pdf/document")
    val extractedText = readPDF(file)
    val oneOfList = List("single","married","divorced")
    val multiList = List("java","scala","c","php","sql")
    val keywords = Map("civil status"-> OneOf(oneOfList), "skills" -> MultipleOf(multiList))
    
    val jsonObjs : List[String] = getJSONObjects(extractedText, keywords)
    
    //jsonObjs -> List(
    //                "{ "civil status" : "married" , "skills" : [java, c] }",
    //                "{ "civil status" : "single" , "skills" : [scala, php, sql] }"
    //                )

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    import parser.generator.utils.SpecificationJava;
    import java.util.HashMap;
    import java.util.List;
    import java.util.ArrayList;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
            
            ArrayList oneOfList = new ArrayList();
            ArrayList multiList = new ArrayList();
            
            oneOfList.add("single");
            oneOfList.add("married");
            oneOfList.add("divorced");
            
            multiList.add("java");
            multiList.add("scala");
            multiList.add("c");
            multiList.add("php");
            multiList.add("sql");
                        
            
            HashMap keywords = new HashMap<>();
            keywords.put("name", SpecificationJava.oneOf(oneOfList));
            keywords.put("age", SpecificationJava.multipleOf(multiList));
            
            List jsonOBjs = ex.getJSONObjects(extractedText, keywords);
            
            //jsonObjs -> List(
            //                "{ "civil status" : "married" , "skills" : [java, c] }",
            //                "{ "civil status" : "single" , "skills" : [scala, php, sql] }"
            //                )
         }
    }
  • Getting a Map of keywords and all the values found for them

In case you want to obtain a Map Keywords with all the values found for that keyword, Flipper provides you with that possibility through getAllMatchedValues.

Scala

    import parser.extraction.Extractor.{readPDF, getAllMatchedValues}
    import java.io.File
    import parser.generator.utils.{ProperNoun, Number, OneOf, MultipleOf}
    
    val file = new File("./path/to/pdf/document")
    val extractedText = readPDF(file)
    val keywords = Map("name"-> ProperNoun(), "age" -> Number())
    
    val matchedValues = getAllMatchedValues(extractedText, keywords) 
    //matchedValues -> Map(
    //                      "name" -> List("John Doe", "Jane Doe"),
    //                      "age" -> List("21", "22")
    //                    )

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    import parser.generator.utils.POSTagEnum;
    import parser.generator.utils.SpecificationJava;
    import java.util.HashMap;
    import java.util.Map;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
            
            HashMap keywords = new HashMap<>();
            keywords.put("name", SpecificationJava.postTag(POSTagEnum.PROPERNOUN));
            keywords.put("age", SpecificationJava.postTag(POSTagEnum.NUMBER));
            
            Map matchedValues = ex.getAllMatchedValues(extractedText, keywords);
            
            //matchedValues -> Map(
            //                      "name" -> List("John Doe", "Jane Doe"),
            //                      "age" -> List("21", "22")
            //                    )
         }
    }
  • Getting just a single value for each keyword

This method works exactly like the one above but instead of returning every value found for a keyword, returns only one.

Scala

    import parser.extraction.Extractor.{readPDF, getSingleMatchedValue}
    import java.io.File
    import parser.generator.utils.{ProperNoun, Number, OneOf, MultipleOf}
    
    val file = new File("./path/to/pdf/document")
    val extractedText = readPDF(file)
    val keywords = Map("name"-> ProperNoun(), "age" -> Number())
    
    val matchedValues = getSingleMatchedValue(extractedText, keywords) 
    //matchedValues -> Map(
    //                      "name" -> List("John Doe"),
    //                      "age" -> List("21")
    //                     )

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    import parser.generator.utils.POSTagEnum;
    import parser.generator.utils.SpecificationJava;
    import java.util.HashMap;
    import java.util.Map;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
            
            HashMap keywords = new HashMap<>();
            keywords.put("name", SpecificationJava.postTag(POSTagEnum.PROPERNOUN));
            keywords.put("age", SpecificationJava.postTag(POSTagEnum.NUMBER));
            
            Map matchedValues = ex.getSingleMatchedValue(extractedText, keywords);
            
            //matchedValues -> Map(
            //                      "name" -> List("John Doe"),
            //                      "age" -> List("21")
            //                     )
         }
    }
  • Getting all possible pre-JSON objects for the values found

This method returns a List containing all the possible pre-JSON objects for the values found for the given keywords.

Scala

    import parser.extraction.Extractor.{readPDF, getAllObjects}
    import java.io.File
    import parser.generator.utils.{ProperNoun, Number, OneOf, MultipleOf}
    
    val filePath = new File("./path/to/pdf/document")
    val extractedText = readPDF(filePath)
    val keywords = Map("name"-> ProperNoun(), "age" -> Number())
    
    val matchedValues = getAllObjects(extractedText, keywords) 
    //matchedValues -> List(
    //                      Map("name" -> List("John Doe"), "age" -> List("21")),
    //                      Map("name" -> List("Jane Doe"), "age" -> List("22"))
    //                     )

Java

    import parser.extraction.ExtractorJava;
    import java.io.File;
    import parser.generator.utils.POSTagEnum;
    import parser.generator.utils.SpecificationJava;
    import java.util.HashMap;
    import java.util.List;
    
    public class Example {
         public static void main(String[] args) {
            ExtractorJava ex = new ExtractorJava();
            File file = new File("./path/to/pdf/document");
            String extractedText = ex.readPDF(file);
            
            HashMap keywords = new HashMap<>();
            keywords.put("name", SpecificationJava.postTag(POSTagEnum.PROPERNOUN));
            keywords.put("age", SpecificationJava.postTag(POSTagEnum.NUMBER));
            
            List matchedValues = ex.getAllObjects(extractedText, keywords);
            
            //matchedValues -> List(
            //                      Map("name" -> List("John Doe"), "age" -> List("21")),
            //                      Map("name" -> List("Jane Doe"), "age" -> List("22"))
            //                     )
         }
    }