Training/Testing Script Descriptions
A set of R scripts were developed in order to be able to process the signatures present in an article that is coming from the web application. Below you can find a description of these scripts.
Process Web Article
The script Process_Web_Article.R
is the main script of this process and is called directly by the Web Application. It receives the processid
of the article to process and then it runs the following 6 steps:
- It calculates the
focus_name
's of the signatures in the article (by using the fuzzystrmatch contrib library of PostgreSQL) and it adds them to the tablemain.articles_authors
. - It calculates the LDA Topic by calling the script
Calculate_LDA_topic.R
- It calculates the possible Ethnicities of the last names by calling the script
Calculate_Ethnicity.R
. - It aggregates all the information required to calculate the distances features (title, the keywords, the journals referenced, the subjects, the coauthors and the ethnicities) and then it stores them in the
main.info_for_distances
table. - It calculates the distances features by calling the script
Calculate_Distances.R
. - It predicts which signatures belong to the article's authors by calling the script
Predict_Equal_Authors.R
.
LDA Topic Modeling
The script Calculate_LDA_topic.R
was developed to calculate the LDA Topics of the articles. Using the topicmodels
R library, and the previously created LDA model (located in Models\LDA_model.rds
), this script calculates the LDA topics of all the articles in the database that have no LDA topic yet. The predicted topics are then saved in the main.lda_topic
table within the database.
Ethnicity Prediction for Authors' Last Names
The script Calculate_Ethnicity.R
is responsible for calculating the possible ethinicities for the last names of the different articles' signatures. With the ngram
R library along with the library stringdist
to calculate the phonetics, the script uses the ethnic_*.R
models (located in the Models\
folder) to calculate these possible ethnicities the scripts create the different models (one per ethnicity) which are stored in the Models
folder. The results of this process are then stored in the main.last_name_ethnicities
table within the database.
Calculate Distances Between Pairs of Signatures
The script Calculate_Distances.R
calculates the distances features between the incoming signatures and every other signature in its respective focus name. Using the R library vegan
the Jaccard distance is calculated between the title (using each word), the keywords (using each word), the journals referenced, the subjects (using each word), the coauthors (using the focus names) and the ethnicities (using each ethnicity value), which are retrieved form the main.info_for_distances
table in the database. The results of this process are then stored in the different distances tables located in the distances
schema of the database.
Predict Signatures that Belongs to the Same Author
The script Predict_Equal_Authors.R
is responsible for which signatures already present in the database belong to the same author as the signatures of the article received. This script uses the Logistic Regression Model (located in Models\model_cvglm.rds
) to predict the signatures of the same authors received (using the glmnet
R library). The results of this process are then stored in the main.same_authors
table within the database.