Profile Picture

Gary J. Espitia S.

A short description about yourself.

About Me

Physician. MSc in bioinformatics, junior statistics and machine learning methods, comfortable with multi-modal data manipulation. Proficient with the GNU/Linux digital environment; experience with bioinformatic and scientometric research tools. Preference for machine learning, data science, network analysis and general usage of 3D protein bioinformatic tools. Looking for opportunities in hospitals and the industry related to data science and machine learning applied in health sciences.

Skills & Interests

  • Bioinformatics
  • Scientometric tools
  • Epigenetics
  • AI in Healthcare
  • Data Visualization
  • Academic Research
  • Open Source Software
  • Scientific Communication

Activity Calendar

Apr
Aug
Dec
Feb
Jan
Jul
Jun
Mar
May
Nov
Oct
Sep
Less
More

Content Analysis (Zipf's Law)

Word Frequency Distribution

Zipf's law, named after linguist George Kingsley Zipf (1902-1950), states that the frequency of any word is inversely proportional to its rank in the frequency table. For example, if the most common word occurs n times, the second most common occurs n/2 times, the third most common n/3 times, etc.

Mathematically expressed as: f(r) ∝ 1/rα, where f(r) is the frequency of the word with rank r, and α is close to 1.

This visualization compares the actual vocabulary distribution (blue dots) against the ideal Zipf's Law distribution (dashed line). The phenomenon appears not only in language but across many natural and social systems, reflecting organizational principles of human behavior and information.

About the Cleaned Corpus

The "cleaned corpus" refers to the collection of words processed through several cleaning steps:

  1. Words are extracted from all posts (titles, descriptions, tags, categories)
  2. All words are converted to lowercase
  3. Punctuation and special characters are removed
  4. Very short words (2 characters or less) are filtered out
  5. Common stop words like "a", "an", "the", "and", etc. are removed

This cleaning process is important because it removes noise that would skew the frequency analysis, normalizes text to ensure word variations are counted as the same word, and excludes common words that occur frequently but don't add much meaning.

References:

  • Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley.
  • Piantadosi, S. T. (2014). Zipf's word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.
  • Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
  • Jäger, G. (2012). Power laws and other heavy-tailed distributions in linguistic typology. Advances in Complex Systems, 15(3).
  • Ferrer-i-Cancho, R., & Solé, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, 100(3), 788-791.

Word Frequency Analysis

Rank Word Freq (n/0) Pr(%) Ideal
Source: Analysis of content from titles, descriptions, tags, and categories across all posts in this knowledge base.

Topic Analysis (LDA)

Topic Distribution

Topic Details

Failed to perform LDA analysis: No documents found for analysis

This feature requires modern browser support and may not work on all devices.

This visualization uses Latent Dirichlet Allocation (LDA) to discover topics in the content. The analysis is performed using GPU acceleration via TensorFlow.js for optimal performance.

On mobile: tap topics for details, use the toggle button to switch between chart and table views.