To create a believable President Barack Obama speech generator in R using a selection of his speeches, addresses, and interviews as training data.
✭ While generating speech via randomly sampled blocks of text from the provided documents is, technically, a valid approach (albeit not an impressive one), the caveat is that each block must not be longer than one sentence.
To train your algorithms, we have prepared 8 text documents containing a selection of Obama’s words from various speeches and interviews. The documents are available as a 82KB ZIP archive.
Each submission must be emailed to Mikhail Popov (mikhail@mpopov.com) by 11:59pm on May 20th as a single R script that, when run, will use the documents we provided (above) to create a self-contained function speech
. This function must generate a n sentences-long Obama speech.
speech
must not rely on any objects outside of it. All objects that are not speech
will be removed from the global environment before speech
is used to generate text. See instructions below for creating a self-contained function.
speech(3)
## [1] "There are only so many shortcuts."
## [2] "Ultimately, we have to change the law."
## [3] "And people have to remain focused on that."
The submissions will be evaluated by the Pittsburgh useR group organizers (for code readability, performance, and memory usage) and competing teams (blind peer review).
Code Readability (5%)
All submissions will be uploaded to the Pittsburgh useR group repository on GitHub for groups members to learn from, so formatting and commenting are important. See Hadley Wickham’s style guide for suggestions on writing readable R code.
Note: participants may request to have their submission published anonymously.
Training Performance (15%)
We will measure the speed of importing the provided documents, cleaning up and manipulating the data, and training the algorithm.
Memory Usage (15%)
A more compact speech
object will score higher.
Data Generative Performance (15%)
We will measure the speed of generating a speech.
Blind Peer Review (50%)
The most important deciding factor will be the competing teams themselves. We will generate a short speech using the same seed for each group. Each generated text will be run through a text-to-speech engine and saved as a randomly numbered mp3 file. The audio files will be sent out to the teams who will rank them from least believable to most believable, without knowing which audio file belongs to which team.
A one (1) year subscription to shinyapps.io Standard Edition (a $1,100 value): Unlimited Applications, 1,000 Active Hours, Authentication, Multiple Instances, and Email Support.
Runners up will get a variety of prizes, including Hands-On Programming with R by Garrett Grolemund.
RStudio is a trademark of RStudio, Inc.
We received a single submission from Taylor Pospisil and Lee Richardson, which can be found at https://github.com/Pittsburgh-useR-Group/RObama/tree/master/submission.
You can use the following template and accompanying example for creating a function that contains all the data and models it needs:
make <- function(Object1,Object2) {
force(Object1)
force(Object2)
return(function(n=NULL) {
obj1 <- get('Object1',environment())
obj2 <- get('Object2',environment())
str(obj1)
str(obj2)
ls()
})
}
Let’s see it in action:
set.seed(0)
str(x <- rnorm(10))
## num [1:10] 1.263 -0.326 1.33 1.272 0.415 ...
str(y <- rnorm(10))
## num [1:10] 0.764 -0.799 -1.148 -0.289 -0.299 ...
speech <- make(x,y)
rm(x,y,make)
ls()
## [1] "speech"
speech()
## num [1:10] 1.263 -0.326 1.33 1.272 0.415 ...
## num [1:10] 0.764 -0.799 -1.148 -0.289 -0.299 ...
## [1] "n" "obj1" "obj2"