본문 바로가기

빅데이터

스트림 프로세싱 with Faust - Windows

Faust는 Windowing을 쉽게 지원할 수 있는 기능을 가지고 있다. Windowing을 통해 지난 10분간 데이터의 분석 혹은 매 5분마다 1시간 간격의 데이터 분석과 같은 내용을 수행할 수 있다. 이번 포스팅에서는 Windowing을 사용하는 방법에 대해 알아보고자 한다.

 

Faust는 Hopping, Tumbling window를 지원한다.

Tumbling Windows

Tumbling window는 중복되지 않은 데이터에 대해 일정간격으로 분석할때 사용된다.

from dataclasses import asdict, dataclass
from datetime import timedelta
import json
import random

import faust


@dataclass
class ClickEvent(faust.Record):
    email: str
    timestamp: str
    uri: str
    number: int


app = faust.App("exercise7", broker="kafka://localhost:9092")
clickevents_topic = app.topic("com.udacity.streams.clickevents", value_type=ClickEvent)

uri_summary_table = app.Table("uri_summary", default=int).tumbling(
    timedelta(seconds=10)
) # Tumbling window 설정, 간격:10초


@app.agent(clickevents_topic)
async def clickevent(clickevents):
    async for ce in clickevents.group_by(ClickEvent.uri):
        uri_summary_table[ce.uri] += ce.number
        print(f"{ce.uri}: {uri_summary_table[ce.uri].current()}")


if __name__ == "__main__":
    app.main()

위 코드는 'com.udacity.streams.clickevents' topic에서 나온 데이터를 기준으로 10초 간격동안 uri의 group by당 개수를 uri_summary_table에 등록하여 print를 수행한다.

root@2f57ea8fd2c8:/home/workspace# python exercise6.7.solution.py worker
┌ƒaµS† v1.7.4─┬──────────────────────────────────────────┐
│ id          │ exercise7                                │
│ transport   │ [URL('kafka://localhost:9092')]          │
│ store       │ memory:                                  │
│ web         │ http://localhost:6066/                   │
│ log         │ -stderr- (warn)                          │
│ pid         │ 815                                      │
│ hostname    │ 2f57ea8fd2c8                             │
│ platform    │ CPython 3.7.3 (Linux x86_64)             │
│ drivers     │                                          │
│   transport │ aiokafka=1.0.6                           │
│   web       │ aiohttp=3.6.2                            │
│ datadir     │ /home/workspace/exercise7-data           │
│ appdir      │ /home/workspace/exercise7-data/v1        │
└─────────────┴──────────────────────────────────────────┘
starting➢ ◣
 😊
[2019-11-22 01:31:40,569: WARNING]: https://www.smith.com/main/tags/home/: 783 
[2019-11-22 01:31:40,625: WARNING]: https://robinson.com/app/search.php: 148 
[2019-11-22 01:31:40,627: WARNING]: http://martinez.com/tags/terms/: 267 
[2019-11-22 01:31:40,639: WARNING]: https://www.gomez.com/blog/category/main/register.html: 753 
[2019-11-22 01:31:40,641: WARNING]: https://sullivan-hamilton.com/tags/category/: 489 
[2019-11-22 01:31:40,642: WARNING]: https://eaton-butler.com/blog/wp-content/tags/about.php: 989 
[2019-11-22 01:31:40,643: WARNING]: https://weaver.com/post/: 281 
[2019-11-22 01:31:40,644: WARNING]: https://marquez.com/faq/: 202 
[2019-11-22 01:31:40,646: WARNING]: https://bradley.com/main.htm: 237 
[2019-11-22 01:31:40,646: WARNING]: https://rios.com/main/tags/tag/privacy.html: 177 
[2019-11-22 01:31:40,647: WARNING]: https://www.robinson-combs.com/: 327 
[2019-11-22 01:31:40,648: WARNING]: https://west.com/register.html: 771 
[2019-11-22 01:31:40,649: WARNING]: http://reyes.org/: 296 
[2019-11-22 01:31:40,653: WARNING]: http://www.mcdaniel.com/categories/wp-content/post.php: 662 
[2019-11-22 01:31:40,664: WARNING]: http://oliver-williams.com/author/: 718 
[2019-11-22 01:31:40,665: WARNING]: https://www.martinez-haney.com/index.asp: 15 
[2019-11-22 01:31:40,667: WARNING]: http://www.alvarez-smith.info/wp-content/terms.php: 517 
[2019-11-22 01:31:40,668: WARNING]: https://www.garcia-smith.com/tag/categories/post/: 839 
[2019-11-22 01:31:40,670: WARNING]: https://petersen.biz/: 199 
[2019-11-22 01:31:40,671: WARNING]: https://guzman-zimmerman.com/posts/posts/search/homepage.jsp: 847 
[2019-11-22 01:31:40,672: WARNING]: https://thomas.com/wp-content/blog/post/: 842 
[2019-11-22 01:31:40,673: WARNING]: http://clark.net/: 599 
[2019-11-22 01:31:40,674: WARNING]: http://www.mcclure.info/: 929 
[2019-11-22 01:31:40,675: WARNING]: http://barrett-martin.net/category/search/main/about/: 402 
[2019-11-22 01:31:40,676: WARNING]: https://www.farmer-deleon.com/posts/main/terms/: 202 
[2019-11-22 01:31:40,677: WARNING]: https://thomas.com/wp-content/blog/post/: 1748 
[2019-11-22 01:31:40,678: WARNING]: https://larson.info/faq.htm: 661 
[2019-11-22 01:31:40,678: WARNING]: https://kline.biz/main.htm: 7 
[2019-11-22 01:31:40,759: WARNING]: https://nichols-bailey.info/tags/blog/author/: 44 
[2019-11-22 01:31:40,762: WARNING]: https://www.jones-gomez.biz/homepage/: 39 
[2019-11-22 01:31:40,762: WARNING]: https://hood.com/search.html: 807 
[2019-11-22 01:31:40,763: WARNING]: http://johnson.com/tag/explore/category/: 414 
[2019-11-22 01:31:40,764: WARNING]: https://www.davis.info/register.asp: 677 
[2019-11-22 01:31:40,766: WARNING]: http://www.wright.net/register/: 291 
[2019-11-22 01:31:40,767: WARNING]: https://www.allen.info/blog/main/main/: 724 
[2019-11-22 01:31:40,768: WARNING]: http://wilkins.com/: 242 
[2019-11-22 01:31:40,769: WARNING]: http://www.moran-french.org/terms/: 357 
[2019-11-22 01:31:40,770: WARNING]: https://bailey.org/: 124 
[2019-11-22 01:31:40,771: WARNING]: http://chavez-thompson.info/main/: 190 
[2019-11-22 01:31:40,773: WARNING]: http://barrett-martin.net/category/search/main/about/: 1221 
[2019-11-22 01:31:40,774: WARNING]: https://www.lopez.net/category/index/: 959 
[2019-11-22 01:31:40,775: WARNING]: https://www.price.org/list/categories/blog/terms.htm: 682 
[2019-11-22 01:31:40,776: WARNING]: http://www.spears.net/home.html: 424 

 

Hopping Windows

Hopping window는 지난 x시간 동안의 데이터를 y간격을 반복하여 데이터를 분석하는 것이다. y간격은 x시간보다 길 수 없으며 일부 데이터는 2개의 window간에 중복되어 처리된다. 

from dataclasses import asdict, dataclass
from datetime import timedelta
import json
import random

import faust


@dataclass
class ClickEvent(faust.Record):
    email: str
    timestamp: str
    uri: str
    number: int


app = faust.App("exercise8", broker="kafka://localhost:9092")
clickevents_topic = app.topic("com.udacity.streams.clickevents", value_type=ClickEvent)

uri_summary_table = app.Table("uri_summary", default=int).hopping(
    size=timedelta(minutes=1),
    step=timedelta(seconds=10)
) # hopping window선언을 위한 size와 step


@app.agent(clickevents_topic)
async def clickevent(clickevents):
    async for ce in clickevents.group_by(ClickEvent.uri):
        uri_summary_table[ce.uri] += ce.number
        print(f"{ce.uri}: {uri_summary_table[ce.uri].current()}")


if __name__ == "__main__":
    app.main()

위 코드는 10초 간격으로 1분 동안의 데이터에 대해 분석하는 코드이다.

root@68f789ccd9f1:/home/workspace# python exercise6.8.solution.py worker
┌ƒaµS† v1.7.4─┬──────────────────────────────────────────┐
│ id          │ exercise8                                │
│ transport   │ [URL('kafka://localhost:9092')]          │
│ store       │ memory:                                  │
│ web         │ http://localhost:6066/                   │
│ log         │ -stderr- (warn)                          │
│ pid         │ 818                                      │
│ hostname    │ 68f789ccd9f1                             │
│ platform    │ CPython 3.7.3 (Linux x86_64)             │
│ drivers     │                                          │
│   transport │ aiokafka=1.0.6                           │
│   web       │ aiohttp=3.6.2                            │
│ datadir     │ /home/workspace/exercise8-data           │
│ appdir      │ /home/workspace/exercise8-data/v1        │
└─────────────┴──────────────────────────────────────────┘
starting➢ ◠
 😊
[2019-11-22 01:38:32,252: WARNING]: https://wise.com/: 953 
[2019-11-22 01:38:32,317: WARNING]: https://www.coleman-edwards.com/: 214 
[2019-11-22 01:38:32,321: WARNING]: http://www.calderon.com/index/: 559 
[2019-11-22 01:38:32,323: WARNING]: https://khan-butler.info/app/about.html: 938 
[2019-11-22 01:38:32,326: WARNING]: http://jackson.org/category/index/: 739 
[2019-11-22 01:38:32,335: WARNING]: http://www.benton.com/explore/privacy.html: 377 
[2019-11-22 01:38:32,340: WARNING]: http://herman.biz/search/: 871 
[2019-11-22 01:38:32,342: WARNING]: https://burch-holt.com/post/: 335 
[2019-11-22 01:38:32,344: WARNING]: https://www.molina-rivera.com/: 198 
[2019-11-22 01:38:32,346: WARNING]: http://www.maldonado.info/main.php: 462 
[2019-11-22 01:38:32,348: WARNING]: http://www.patterson.com/app/list/register.asp: 542 
[2019-11-22 01:38:32,352: WARNING]: https://www.west.info/main/categories/posts/faq.htm: 676 
[2019-11-22 01:38:32,355: WARNING]: http://white-miller.com/search/homepage/: 259 
[2019-11-22 01:38:32,357: WARNING]: https://smith.com/login/: 498 
[2019-11-22 01:38:32,360: WARNING]: http://mays.com/: 715 
[2019-11-22 01:38:32,361: WARNING]: https://roberts.com/login.htm: 446 
[2019-11-22 01:38:32,363: WARNING]: https://everett-lee.com/author/: 857 
[2019-11-22 01:38:32,365: WARNING]: https://cruz.com/wp-content/app/login/: 250 
[2019-11-22 01:38:32,368: WARNING]: http://www.lee.com/privacy.htm: 862 
[2019-11-22 01:38:32,370: WARNING]: https://wallace.com/post.php: 456 
[2019-11-22 01:38:32,372: WARNING]: https://www.atkins.com/search.jsp: 995 
[2019-11-22 01:38:32,374: WARNING]: https://www.chan.com/search/main/blog/category/: 659 
[2019-11-22 01:38:32,375: WARNING]: https://www.wright.com/category/login/: 461 
[2019-11-22 01:38:32,377: WARNING]: https://martinez.biz/home/: 877 
[2019-11-22 01:38:32,378: WARNING]: http://www.keith.com/terms/: 873 
[2019-11-22 01:38:32,380: WARNING]: http://www.hamilton-lee.net/: 735 
[2019-11-22 01:38:32,383: WARNING]: http://rice.org/index/: 946 
[2019-11-22 01:38:32,385: WARNING]: http://www.carpenter.com/categories/blog/posts/author/: 497 
[2019-11-22 01:38:32,387: WARNING]: https://chang.com/blog/explore/blog/terms/: 861 
[2019-11-22 01:38:32,388: WARNING]: https://adams-brown.biz/explore/tag/tag/post.php: 293 
[2019-11-22 01:38:32,390: WARNING]: https://ayala.com/: 622 
[2019-11-22 01:38:32,392: WARNING]: https://www.wilson-harris.com/category/wp-content/terms.php: 790 
[2019-11-22 01:38:32,395: WARNING]: https://erickson-hamilton.com/login.htm: 405 
[2019-11-22 01:38:32,396: WARNING]: http://www.herrera-navarro.info/explore/explore/login/: 234 
[2019-11-22 01:38:32,400: WARNING]: https://www.miller.com/homepage/: 45 
[2019-11-22 01:38:32,402: WARNING]: https://www.morales.com/search.htm: 196 
[2019-11-22 01:38:32,404: WARNING]: http://hardin.org/index.htm: 314 
[2019-11-22 01:38:32,406: WARNING]: http://www.davenport-ryan.com/: 976 
[2019-11-22 01:38:32,407: WARNING]: https://moore.info/main/wp-content/categories/search.html: 231 
[2019-11-22 01:38:32,409: WARNING]: http://ross-gutierrez.org/: 569 
[2019-11-22 01:38:32,410: WARNING]: https://jordan-oliver.net/tags/search/search/about.html: 895 
[2019-11-22 01:38:32,412: WARNING]: http://www.smith.info/search/wp-content/category/privacy.php: 488 
[2019-11-22 01:38:32,415: WARNING]: http://www.garcia.com/: 984 
[2019-11-22 01:38:32,416: WARNING]: http://ibarra-reese.com/blog/wp-content/posts/homepage/: 964 
[2019-11-22 01:38:32,418: WARNING]: http://www.hopkins.com/: 946 
[2019-11-22 01:38:32,420: WARNING]: http://www.stephens-barnes.biz/about/: 389 
[2019-11-22 01:38:32,421: WARNING]: https://www.molina-rivera.com/: 762 
[2019-11-22 01:38:32,422: WARNING]: http://www.reynolds.com/search/app/app/index/: 141 
[2019-11-22 01:38:32,425: WARNING]: https://www.dyer.info/explore/search/main/main/: 23 
[2019-11-22 01:38:32,426: WARNING]: http://www.white-carpenter.com/explore/main/posts/home/: 627 
[2019-11-22 01:38:32,428: WARNING]: https://juarez.com/categories/main/: 796 
[2019-11-22 01:38:32,431: WARNING]: https://www.robles.com/app/homepage.html: 22 
[2019-11-22 01:38:32,437: WARNING]: https://www.cline-wilson.biz/main.html: 909 
[2019-11-22 01:38:32,448: WARNING]: http://www.fuentes-holland.com/blog/category/blog/category/: 811 
[2019-11-22 01:38:32,450: WARNING]: http://www.maldonado.info/main.php: 595 
[2019-11-22 01:38:32,452: WARNING]: http://bell.com/explore/posts/wp-content/category/: 261 

 


태그