|
Notas
| Description of the dataset
The texts originate from two corpora published by TASS (Workshop on Semantic Analysis at SEPLN):
(1) InterTASS. 3,413 tweets (years 2016–2017), each containing at least one adjective and more than three words, partitioned into training (C_Train), development (C_Dev), and test (C_Test) sets; the test set comprises 1,899 tweets.
(2) General corpus. 57,832 tweets (years 2011–2012) covering multiple topics (e.g., politics, economics, communication, and culture). The original six‑class annotation was reduced to four classes by merging P+ with P and N+ with N.
Class distribution:
(1) In InterTASS, training and development are imbalanced (N and P are predominant), and the C_Test set (1,899 tweets) maintains a similar profile.
(2) In the General corpus (after mapping to four classes): P = 21,262 (36.76%), NEU = 1,300 (2.25%), N = 15,124 (26.15%), NONE = 20,146 (34.83%); the pronounced underrepresentation of NEU (<3%) is relevant for analysis and model evaluation.
Labels and partitions.
The dataset adopts four document-level labels (P, NEU, N, NONE). For reproducibility, the C_Train, C_Dev, and C_Test partitions from InterTASS are preserved, as is the complete General corpus with the six‑to‑four class mapping. |