Skip to content

Latest commit

 

History

History
77 lines (64 loc) · 1.74 KB

WikiCorpus.md

File metadata and controls

77 lines (64 loc) · 1.74 KB

WikiCorpus

Very commonly used corpus in general. The loader (and default datadep) is for Samuel Reese's 2006 based corpus. The only real feature we rely on the input having is the <doc title="DocTitle"..> tags separating the documents. So any corpus following close-enough to that should work.

We capture a lot of structure. Document, section, paragraph/line, sentence, word. Note that paragraph/line level does not differentiate between a paragraph of prose, vs a line in a list.

Most users are not going to be wanting that level of structure, so should use flatten_levels (from MultiResolutionIterators.jl) to get rid of levels they don't want.

Example:

julia> using CorpusLoaders;
julia> using MultiResolutionIterators;
julia> using Base.Iterators;

julia> corpus_gen = load(WikiCorpus())
Channel{CorpusLoaders.Document{Array{Array{Array{Array{InternedStrings.InternedString,1},1},1},1},InternedStrings.InternedString}}(sz_max:4,sz_curr:4)

julia> subcorpus = collect(take(corpus_gen, 5));

julia> title.(subcorpus)
5-element Array{InternedStrings.InternedString,1}:
 "Henry Hallam"
 "Sungai Besi LRT station"
 "1808 in poetry"
 "3 Flies Up"
 "Sterling College (Vermont)"

julia> MultiResolutionIterators.levelname_map(WikiCorpus)
8-element Array{Pair{Symbol,Int64},1}:
 :doc=>1
 :section=>2
 :para=>3
 :line=>3
 :sent=>4
 :word=>5
 :token=>5
 :char=>6

julia> flatten_levels(subcorpus, (!lvls)(WikiCorpus, :word)) |> full_consolidate #Lets just get a series of words
27655-element Array{InternedStrings.InternedString,1}:
 "Henry"
 "Hallam"
 "("
 "July"
 "9"
 ","
 "1777"
 "-"
 "January"
 "21"
 ","
 "1859"
 ")"
 "was"
 "an"
 "English"
 "historian"
 
 "``"
 "I"
 "'m"
 "So"
 "Bad"
 "''"
 "3:49"
 ";"