UnB Saxion Aalborg Minho

EGP UnB

Project Management Office of the University of Brasília

Erasmus+

Introduction to the Natural Language Processing System and Hierarchical Document Classification

In a high-turnover academic environment, PDF repositories come with varied formatting, images, tables, and noise that make any automated processing difficult. To address this, we implemented an AI pipeline that combines advanced OCR, structured extraction, and semantic normalization—removing inconsistencies and transforming each document into clean, hierarchical text. This purified content is then sent to our LLM, ensuring:

EGP UnB information flowchart
  • Improved accuracy in semantic search;
  • Generation of summaries and contextual tags;
  • Data consistency for analysis and decision-making.
Access Project

Context and Objectives

Scope

The diagram below illustrates the data flow selected for implementing the prototype.

EGP UnB information flowchart
  • PDF file repository: 89 files, 82,340 words
  • No structural standardization: files from different deliverables produced by the 4 participating universities.

Objectives

Perform serial reading of files (pdf, xlsx, markdown, csv), context cleaning, and dimensionality reduction to create specialized prompts for the domain of the uploaded files.

EGP UnB information flowchart
  • Preserve academic history: Centralize and organize all versions of requirements, documents, and previous deliverables.
  • Efficient onboarding: Enable new participants to access the full project context in seconds, without relying exclusively on senior members.
  • Knowledge continuity: Ensure that decisions and lessons learned from past semesters are easy to retrieve and reuse.
  • Smart automation: Use AI to automatically index, classify, and hierarchically organize documents in multiple formats (PDF, Excel, Markdown).
  • Reduced rework: Minimize duplicated effort when searching for and interpreting artifacts from previous semesters.

Method

The diagram below illustrates the flow of static analysis of word frequency within files through natural language processing, dimensionality reduction, vectorization, clustering, level-based classification, and feeding the LLM agent with the generated domain data.

EGP UnB information flowchart EGP UnB information flowchart EGP UnB information flowchart EGP UnB information flowchart
  • Input: 89 files, 82,340 words
  • Output: 89 summarized files, each accompanied by 6 descriptive key phrases.

Results

The flowchart below describes the steps of natural language processing, dimensionality reduction, vectorization, clustering, and file classification that generate summaries and inputs for an AI assistant large language model.

  • File Classification Algorithm: Folder/file classification algorithm
  • EGP UnB information flowchart
  • Assisted Stopwords Cleaning Algorithm: Assisted learning algorithm for stopword cleaning
  • EGP UnB information flowchart EGP UnB information flowchart
  • Restricted-domain exploration: Feeding performed through keyword frequency mapping
  • EGP UnB information flowchart

    K-Means = 3

    PCA 3D K3 TSNE K3 UMAP K3 Metrics K3

    K-Means = 4

    PCA 3D K4 TSNE K4 UMAP K4 Metrics K4

    K-Means = 5

    PCA 3D K5 TSNE K5 UMAP K5 Metrics K5

    K-Means = 7

    PCA 3D K7 TSNE K7 UMAP K7 Metrics K7

    K-Means = 8

    PCA 3D K8 TSNE K8 UMAP K8 Metrics K8

    K-Means = 9

    PCA 3D K9 TSNE K9 UMAP K9 Metrics K9

    Full exploration

    EGP UnB information flowchart EGP UnB information flowchart EGP UnB information flowchart EGP UnB information flowchart
  • Smart automation: Generation of summaries and classification key phrases to feed a file database for Wiki search
  • ALPHA.

    Id: 0

    Name: AAU - Mobile Education - 2021 - Final Report - Secure Software Development, Web Security, Injection Attacks & Taint Analysis

    Summary: The content addresses secure software development, including vulnerability analysis, cyber and injection attacks, and emphasizes the importance of information security and secure development. It also highlights initiatives focused on mobile education for vulnerable communities, such as building a digital platform for financial education for recyclable material waste pickers.

    Labels:
    - Secure software development and vulnerability analysis
    - Secure software development and cyberattacks
    - Secure software development and mobile education for vulnerable communities
    - Secure software development and injection attacks
    - Development of a digital platform for financial education for recyclable material waste pickers
    - Information security and secure development

    Wiki Database
  • Restricted-domain chatbot: Centralize the LLM’s operation within a clean, restricted data domain.
  • COMING SOON.
    EGP UnB information flowchart
    EGP UnB information flowchart
    EGP UnB information flowchart
    EGP UnB information flowchart
    EGP UnB information flowchart
    EGP UnB information flowchart

Next Steps

  • Develop and validate an agent performance evaluation system.
  • Test scientifically described NLP processes and analyze results.
  • Test scientifically described normalization processes and analyze results.
  • Test dimensionality reduction processes and analyze results.
  • Test vectorization processes and analyze results.
  • Test dimensionality plotting procedures and analyze results.
  • Test clustering processes and analyze results.
  • Test LLM feeding flows and analyze results.
  • Evaluate different LLM models and compare performance.
  • Draft a compliance checklist for Ready to Use Software Product (RUSP).

References

  • Official Educado Documentation (MkDocs)
  • Jurafsky, D.; Martin, J. H., Speech and Language Processing, 3rd Edition, Pearson, 2021. (Introduction to NLP techniques)
  • MacQueen, J. B., “Some Methods for Classification and Analysis of Multivariate Observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, 1967. (Original k-means paper)
  • Brown, T. B.; Mann, B.; Ryder, N.; et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. (Modern LLMs)