Laurel Hart

Computational linguist — Software engineer

New NLP-ish Repo

October 28, 2014

Curious about what I worked on over the summer? Check out my new, elegantly-named repo:

I decided to open source my work because a) my bosses had no problem with it (i.e., “because I can”) and b) I believe there are parts which may be more widely useful, such as:

  • GigawordCorpusHandler, which parses the XGML structure given by LCD's Gigaword Corpus (consistent across English, Spanish, and Chinese);
  • CoreNLPProcessor, which uses Stanford's CoreNLP to perform tokenization, NER, and POS annotation (although not the most efficiently... more on that later); and
  • Utils, which provides a function to check whether a file is gzipped, then returns the appropriate InputStream (useful for dealing with large corpora like Gigaword, but not having to gzip all files).

In the coming months I’ll likely work on breaking them out into their own, useful little repos, but since the code runs, I might as well make it available now. Enjoy!