
Gary J. Espitia S.
A short description about yourself.
About Me
Skills & Interests
- Bioinformatics
- Scientometric tools
- Epigenetics
- AI in Healthcare
- Data Visualization
- Academic Research
- Open Source Software
- Scientific Communication
Activity Calendar
Content Analysis (Zipf's Law)
Word Frequency Distribution
Zipf's law, named after linguist George Kingsley Zipf (1902-1950), states that the frequency of any word is inversely proportional to its rank in the frequency table. For example, if the most common word occurs n times, the second most common occurs n/2 times, the third most common n/3 times, etc.
Mathematically expressed as: f(r) ∝ 1/rα, where f(r) is the frequency of the word with rank r, and α is close to 1.
This visualization compares the actual vocabulary distribution (blue dots) against the ideal Zipf's Law distribution (dashed line). The phenomenon appears not only in language but across many natural and social systems, reflecting organizational principles of human behavior and information.
About the Cleaned Corpus
The "cleaned corpus" refers to the collection of words processed through several cleaning steps:
- Words are extracted from all posts (titles, descriptions, tags, categories)
- All words are converted to lowercase
- Punctuation and special characters are removed
- Very short words (2 characters or less) are filtered out
- Common stop words like "a", "an", "the", "and", etc. are removed
This cleaning process is important because it removes noise that would skew the frequency analysis, normalizes text to ensure word variations are counted as the same word, and excludes common words that occur frequently but don't add much meaning.
References:
- Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley.
- Piantadosi, S. T. (2014). Zipf's word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
- Jäger, G. (2012). Power laws and other heavy-tailed distributions in linguistic typology. Advances in Complex Systems, 15(3).
- Ferrer-i-Cancho, R., & Solé, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, 100(3), 788-791.
Word Frequency Analysis
Rank | Word | Freq (n/0) | Pr(%) | Ideal |
---|
Topic Analysis (LDA)
Topic Distribution
Topic Details
Failed to perform LDA analysis: No documents found for analysis
This feature requires modern browser support and may not work on all devices.
This visualization uses Latent Dirichlet Allocation (LDA) to discover topics in the content. The analysis is performed using GPU acceleration via TensorFlow.js for optimal performance.
On mobile: tap topics for details, use the toggle button to switch between chart and table views.