Valid N-gram & POS Service Documentation

This is API service to get the valid ngram possibly (1,2,3,4) from set of words, or sentences. This service is also capable of assign the most frequent POS i.e part of speech tag to each of those ngram extract from a given content or input.

This service can be useful for:

  • Valid N-gram Extraction,
  • POS Assignment,
  • Verification,
  • Filtering Sentences.
  • Checking Validity of words.

Language Supported ?

  • JavaScript
  • Python
  • PHP.

Why Python?

N-gram Extraction and Verification

Eleven inputs the user can give as the input.

  • content
  • content_type
  • delimiter
  • verify_for
  • method
  • parameter
    • param1
    • param2
    • param3
    • param4
  • ngram_n
  • ngram_n_max
  • source_id
  • language
  • email
{ "content" : "animalia is a book;boy", "content_type": "ngram", "delimiter": ";", "verify_for" : "ngram", "method" : "", "param1" : , "ngram_n": 2, "ngram_n_max": 3, "source_id" : 1, "language" : "en", "email" : "user@gmail.com" }
The main service url : https://ngrampos.vipresearch.ca/ngram_pos/service/word_service/

content : Users will have to input the sentence or single word or group of words seperated by a delimeter which is optional

content_type : Two options are available i.e. " ngram " and " pos ".

delimeter : (Optional) User can specify the separator between the sentences or words, if not assigned the default value of space will act as a delimeter.

verify_for : (Optional) If " ngram " is selected will get results for the ngram, if " pos " is selected will get results for the pos and its details, or if "both" is selected, the system wil provide results for both ngram and pos. The default value if nothing is provided is "ngram"

method : (Optional) When verifying using pos tags, either sort pos tags by method "top" or "doc-based-snlp"; if this argument is not provided, the system will use the "top" by default.

parameter (not an input, just used here to simplify explanation) : (Optional) When sorting pos tags using top or doc-based SNLP, a parameter needs to be provided.

param1 : (Optional) To be used when using the method : "top", and to be left empty or unassigned when method : "doc-based-snlp". It accepts any positive integer (lets say x) as a parameter, where the top "x" pos tags will be used to verify the given content. The default value if nothing is provided is 15

param2 : (Optional) Set Collection. To be used when using the method : "doc-based-snlp", and to be left empty or unassigned when method : "top". It accepts one of two values, "union" and "intersect". The default value if nothing is provided is "union". Current the only available value is "union"

param3 : (Optional) Top POS Number. To be used when using the method : "doc-based-snlp", and to be left empty or unassigned when method : "top". It accepts one of three values, 5, 15, and 50. The default value if nothing is provided is 15.

param4 : (Optional) Standard Deviation Value. To be used when using the method : "doc-based-snlp", and to be left empty or unassigned when method : "top". It accepts values between 1.5 and 6.5, in progressions of 0.5. So 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5. The default value if nothing is provided is 1.5

ngram_n : The specific ngram the user want from the sentences. It accepts integers between 1 and 4 which are: 1, 2, 3, 4.

ngram_n_max : All the ngrams from 1 to value provided will be extracted and processed.

source_id : (Optional) Specify a source id number 1, 2, or 3. The default value if nothing is provided is 1

  • 1: Dbpedia long abstracts
  • 2: Google Books
  • 3: Dbpedia labels

language : (Optional) This is the language code for the language of the ngram you wish to search for. e.g. "fr" for french. The default value if nothing is provided is "en" for english

email : (Optional) Provide an email address every time so that we are able to optimize the service and potentially provide updates on major services changes.

[ { "ngram_asked": { "amount": 7, "valid": { "a book": [ { "valid_ngram": "True", "pos": "DT-NN", "ngram": 2 } ] }, "invalid": { "animalia is": [ { "valid_ngram": "False", "pos_tag": "NN-VBZ", "ngram": 2 } ], "is a": [ { "valid_ngram": "False", "pos_tag": "VBZ-DT", "ngram": 2 } ], "book boy": [ { "valid_ngram": "False", "pos_tag": "NN-NN", "ngram": 2 } ], "animalia is a": [ { "valid_ngram": "False", "pos_tag": "NN-VBZ-DT", "ngram": 3 } ], "is a book": [ { "valid_ngram": "False", "pos_tag": "VBZ-DT-NN", "ngram": 3 } ], "a book boy": [ { "valid_ngram": "False", "pos_tag": "DT-NN-NN", "ngram": 3 } ] }, "validity": ["False","False","True","False","False","False","False"], "valid_ngram": ["a book"], "valid_ngram_n": [2], "valid_ngram_pos": ["DT-NN"], "invalid_ngram_pos": ["NN-VBZ","VBZ-DT","NN-VBZ-DT","VBZ-DT-NN","DT-NN-NN"], "invalid_ngram": ["animalia is","is a","book boy","animalia is a","is a book","a book boy"], "invalid_ngram_n": [2,2,2,3,3,3] } } ]

verify_for : If its set to "pos", we obtain the following results

[ { "pos_asked": { "invalid": { "animalia is": [ { "valid_pos": "True", "pos_tag": "NN-VBZ", "Full-form": "Noun (singular)-Verb (3rd-Person singular present)", "pos_frequency": "7159", "ngram": 2 } ], "is a": [ { "valid_pos": "True", "pos_tag": "VBZ-DT", "Full-form": "Verb (3rd-Person singular present)-Determiner", "pos_frequency": "494", "ngram": 2 } ], "animalia is a": [ { "valid_pos": "True", "pos_tag": "NN-VBZ-DT", "Full-form": "Noun (singular)-Verb (3rd-Person singular present)-Determiner", "pos_frequency": "Unknown", "ngram": 3 } ], "is a book": [ { "valid_pos": "True", "pos_tag": "VBZ-DT-NN", "Full-form": "Verb (3rd-Person singular present)-Determiner-Noun (singular)", "pos_frequency": "Unknown", "ngram": 3 } ], "a book boy": [ { "valid_pos": "True", "pos_tag": "DT-NN-NN", "Full-form": "Determiner-Noun (singular)-Noun (singular)", "pos_frequency": "Unknown", "ngram": 3 } ] }, "valid": { "a book": [ { "valid_pos": "True", "pos_tag": "DT-NN", "Full-form": "Determiner-Noun (singular)", "pos_frequency": 18236, "ngram": 2 } ], "book boy": [ { "valid_pos": "True", "pos_tag": "NN-NN", "Full-form": "Noun (singular)-Noun (singular)", "pos_frequency": 51192, "ngram": 2 } ] }, "amount": 7, "valid_pos": ["DT-NN","NN-NN"], "invalid_pos": ["NN-VBZ","VBZ-DT","NN-VBZ-DT","VBZ-DT-NN","DT-NN-NN"] } } ]

Parts of Speech Validity

When the content_type is set to " pos ", we will obtain the information of the part of speech tags.

The json results carries the output

  • Abbreviation
  • Frequency
  • Validity
{ "content" : "NN,JJ-IN", "delimeter": ",", "content_type": "pos" }

The json response from the above API call looks like:

[ { "pos": { "amount": 2, "valid": { "NN": [ { "valid_pos": "True", "Full-form": "Noun (singular)", "pos_frequency": 59685 } ] }, "invalid": { "JJ-IN": [ { "valid_pos": "False", "Full-form": "Adjective-Preposition", "pos_frequency": "1284" } ] }, "validity": [ "True", "False" ], "valid_pos": [ "NN" ], "invalid_pos": [ "JJ-IN" ] } } ]

The response shows the the validity of each pos tags asked by the user, it can be helpful to check grammatical correctness.

Retrieving Top POS

This service helps to capture the current top-15 and most frequent part of speech tags and present the response in JSON or CSV format, depending on the user preferences.

URL : https://ngrampos.vipresearch.ca/ngram_pos/service/get_list.php

The json string should look like :

fomat JSON and CSV format are supported in the output.(default - JSON)

{"format":"json"}

The Output in JSON :

{ "NN": [ { "Abbrevation": "NN", "pos_count": 51402 } ], "NN-NN": [ { "Abbrevation": "NN-NN", "pos_count": 41403 } ], "JJ-NN": [ { "Abbrevation": "JJ-NN", "pos_count": 38855 } ], "NNP-NNP": [ { "Abbrevation": "NNP-NNP", "pos_count": 27907 } ], .......... ...... } {"format":"csv"}

The Output in CSV :

NN,NN-NN,JJ-NN,NNP-NNP,NN-IN,DT-JJ-NN,IN-DT-NN,NNP-NN,DT-NN,NNP-NNP-NNP,JJ-NN-IN, JJ-NNS,IN-NNP,NN-IN-DT,NN-NNS

Accessing API/Implementing in Code

In php :

Users can call this API in the below format :

$input_arr = array( 'content' => "dogs are wonderful;enjoyable", 'content_type' => 'ngram', 'delimeter'=> ";", 'method' => 'doc-based-snlp', 'param2' => 'union', 'param3' => 10, // Since param4 was skipped that will be defaulted to 1.5 'ngram_n' => 1, 'verify_for' => 'ngram'); $json = json_encode($input_arr); $context = array('http' => array( 'method' => 'POST', 'header' => 'Content-Type: application/json', 'content' => $json ) ); $context = stream_context_create($context); // use file_get_get_contents or curl and json_decode to capture response $url = "https://ngrampos.vipresearch.ca/ngram_pos/service/word_service/"; $contents = file_get_contents($url, false, $context);

In Python :

Users can call this API in the below format :

import requests parameters = {"content" : "stones are hard;cake", "content_type":"ngram", "delimeter" : ";", "method" : "top", "param1" : "50", "verify_for" : "ngram", "ngram_n" : 1} url = "https://ngrampos.vipresearch.ca/ngram_pos/service/word_service/" r = requests.get(url, json=parameters) print(r.json())