IMDB movie reviews dataset a standard collection for Binary Sentiment Analysis task. It is used for benchmarking Sentiment Analysis algorithms. It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided
Structure of the reviews contain different levels:
documents, sentences, words/tokens, characters
Whole data is divided into 5 parts which can be accessed by providing following keywords:
train_pos
: positive polarity sentiment train set examples (default)
train_neg
: negative polarity sentiment train set examples
test_pos
: positive polarity sentiment test set examples
test_neg
: negative polarity sentiment test set examples
train_unsup
: unlabeled examples
To get rid of unwanted levels, flatten_levels
function from MultiResolutionIterators.jl can be used.
Example:
#Using "test_neg" keywords for negative test set examples
julia> dataset_test_neg = load(IMDB("test_neg"))
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
julia> docs = collect(take(dataset_test_neg, 2))
2-element Array{Array{Array{String,1},1},1}:
[["Once", "again", "Mr.", "Costner", "has", "dragged", "out", "a", "movie", "for", "far", "longer", "than", "necessary", "."], ["Aside", "from", "the", "terrific", "sea", "rescue", "sequences", ",", "of", "which" … "just", "did", "not", "care", "about", "any", "of", "the", "characters", "."], ["Most", "of", "us", "have", "ghosts", "in", "the", "closet", ",", "and" … "later", ",", "by", "which", "time", "I", "did", "not", "care", "."], ["The", "character", "we", "should", "really", "care", "about", "is", "a", "very", "cocky", ",", "overconfident", "Ashton", "Kutcher", "."], ["The", "problem", "is", "he", "comes", "off", "as", "kid", "who", "thinks" … "him", "and", "shows", "no", "signs", "of", "a", "cluttered", "closet", "."], ["His", "only", "obstacle", "appears", "to", "be", "winning", "over", "Costner", "."], ["Finally", "when", "we", "are", "well", "past", "the", "half", "way", "point" … ",", "Costner", "tells", "us", "all", "about", "Kutcher", "'s", "ghosts", "."], ["We", "are", "told", "why", "Kutcher", "is", "driven", "to", "be", "the", "best", "with", "no", "prior", "inkling", "or", "foreshadowing", "."], ["No", "magic", "here", ",", "it", "was", "all", "I", "could", "do", "to", "keep", "from", "turning", "it", "off", "an", "hour", "in", "."]]
[["This", "is", "an", "example", "of", "why", "the", "majority", "of", "action", "films", "are", "the", "same", "."], ["Generic", "and", "boring", ",", "there", "'s", "really", "nothing", "worth", "watching", "here", "."], ["A", "complete", "waste", "of", "the", "then", "barely-tapped", "talents", "of", "Ice-T" … "they", "are", "capable", "of", "acting", ",", "and", "acting", "well", "."], ["Do", "n't", "bother", "with", "this", "one", ",", "go", "see", "New" … "Friday", "for", "Ice", "Cube", "and", "see", "the", "real", "deal", "."], ["Ice-T", "'s", "horribly", "cliched", "dialogue", "alone", "makes", "this", "film", "grate" … "the", "heck", "Bill", "Paxton",
"was", "doing", "in", "this", "film", "?"], ["And", "why", "the", "heck", "does", "he", "always", "play", "the", "exact", "same", "character", "?"], ["From", "Aliens", "onward", ",", "every", "film", "I", "'ve", "seen", "with" … "><br", "/", ">Overall", ",", "this", "is", "second-rate", "action", "trash", "."], ["There", "are", "countless", "better", "films", "to", "see", ",", "and", "if" … "copy", "but", "has", "better", "acting", "and", "a", "better", "script", "."], ["The", "only", "thing", "that", "made", "this", "at", "all", "worth", "watching" … "for", "the", "horrible", "film", "itself", "-", "but", "not", "quite", "."], ["4", "/", "10", "."]]
#Using "train_pos" keyword for positive train set examples
julia> dataset_train_pos = load(IMDB()) #no need to specify category because "train_pos" is default
Channel{Array{Array{String,1},1}}(sz_max:4,sz_curr:4)
julia> using Base.Iterators
julia> docs = collect(take(dataset_train_pos, 2))
2-element Array{Array{Array{String,1},1},1}:
[["Bromwell", "High", "is", "a", "cartoon", "comedy", "."], ["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."], ["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to" … "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."], ["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who" … "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."], ["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly" … "immediately", "recalled", "...",
"...", "...", "at", "...", "...", "...", "."], ["High", "."], ["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."], ["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."], ["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."], ["What", "a", "pity", "that", "it", "is", "n't", "!"]]
[["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has" … "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."], ["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost" … "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"], ["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel" … "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."], ["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the" … "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."], ["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after" … "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."], ["They", "'re", "survivors", "."], ["Bolt", "is", "n't", "."], ["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he" … "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."], ["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of" … "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]]
julia> flatten_levels(docs, lvls(IMDB, :documents))|>full_consolidate
19-element Array{Array{String,1},1}:
["Bromwell", "High", "is", "a", "cartoon", "comedy", "."]
["It", "ran", "at", "the", "same", "time", "as", "some", "other", "programs", "about", "school", "life", ",", "such", "as", "``", "Teachers", "''", "."]
["My", "35", "years", "in", "the", "teaching", "profession", "lead", "me", "to" … "much", "closer", "to", "reality", "than", "is", "``", "Teachers", "''", "."]
["The", "scramble", "to", "survive", "financially", ",", "the", "insightful", "students", "who" … "me", "of", "the", "schools", "I", "knew", "and", "their", "students", "."]
["When", "I", "saw", "the", "episode", "in", "which", "a", "student", "repeatedly" … "immediately", "recalled", "...", "...", "...", "at", "...", "...", "...", "."]
["High", "."]
["A", "classic", "line", ":", "INSPECTOR", ":", "I", "'m", "here", "to", "sack", "one", "of", "your", "teachers", "."]
["STUDENT", ":", "Welcome", "to", "Bromwell", "High", "."]
["I", "expect", "that", "many", "adults", "of", "my", "age", "think", "that", "Bromwell", "High", "is", "far", "fetched", "."]
["What", "a", "pity", "that", "it", "is", "n't", "!"]
["Homelessness", "(", "or", "Houselessness", "as", "George", "Carlin", "stated", ")", "has" … "school", ",", "work", ",", "or", "vote", "for", "the", "matter", "."]
["Most", "people", "think", "of", "the", "homeless", "as", "just", "a", "lost" … "to", "see", "what", "it", "'s", "like", "to", "be", "homeless", "?"]
["That", "is", "Goddard", "Bolt", "'s", "lesson.<br", "/", "><br", "/", ">Mel" … "wants", "with", "a", "future", "project", "of", "making", "more", "buildings", "."]
["The", "bet", "'s", "on", "where", "Bolt", "is", "thrown", "on", "the" … "move", "where", "he", "ca", "n't", "step", "off", "the", "sidewalk", "."]
["He", "'s", "given", "the", "nickname", "Pepto", "by", "a", "vagrant", "after" … "Wilson", ")", "who", "are", "already", "used", "to", "the", "streets", "."]
["They", "'re", "survivors", "."]
["Bolt", "is", "n't", "."]
["He", "'s", "not", "used", "to", "reaching", "mutual", "agreements", "like", "he" … "do", "n't", "know", "what", "to", "do", "with", "their", "money", "."]
["Maybe", "they", "should", "give", "it", "to", "the", "homeless", "instead", "of" … "maybe", "this", "film", "will", "inspire", "you", "to", "help", "others", "."]